WO2023098222A1 - Multi-service scenario identification method and decision forest model training method - Google Patents

Multi-service scenario identification method and decision forest model training method Download PDF

Info

Publication number
WO2023098222A1
WO2023098222A1 PCT/CN2022/118249 CN2022118249W WO2023098222A1 WO 2023098222 A1 WO2023098222 A1 WO 2023098222A1 CN 2022118249 W CN2022118249 W CN 2022118249W WO 2023098222 A1 WO2023098222 A1 WO 2023098222A1
Authority
WO
WIPO (PCT)
Prior art keywords
forest model
data flow
decision
decision forest
flow characteristics
Prior art date
Application number
PCT/CN2022/118249
Other languages
French (fr)
Chinese (zh)
Inventor
王子晟
张耀东
刘昕颖
Original Assignee
中兴通讯股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中兴通讯股份有限公司 filed Critical 中兴通讯股份有限公司
Publication of WO2023098222A1 publication Critical patent/WO2023098222A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/2431Multiple classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/01Dynamic search techniques; Heuristics; Dynamic trees; Branch-and-bound
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L47/00Traffic control in data switching networks
    • H04L47/10Flow control; Congestion control
    • H04L47/24Traffic characterised by specific attributes, e.g. priority or QoS
    • H04L47/2441Traffic characterised by specific attributes, e.g. priority or QoS relying on flow classification, e.g. using integrated services [IntServ]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L47/00Traffic control in data switching networks
    • H04L47/10Flow control; Congestion control
    • H04L47/24Traffic characterised by specific attributes, e.g. priority or QoS
    • H04L47/2483Traffic characterised by specific attributes, e.g. priority or QoS involving identification of individual flows

Definitions

  • Embodiments of the present invention relate to the technical field of wireless local area networks, and in particular to a multi-service scene identification method and a decision forest model training method.
  • the business scene recognition algorithm allows (Wireless Local Area Network, WLAN) access points and devices to perceive the applications, services and scenes that the current user is accessing, such as games, voice, video, live broadcast, etc. Then, according to the characteristics of applications, services, and scenarios, the access point provides users with different WLAN parameter configurations and services, which can maximize the user's network access experience. For example, game packets are sent with priority to provide users with a low-latency and low-jamming gaming experience. Among them, the business scene recognition algorithm is the cornerstone of the entire process, and its performance is crucial.
  • the current business scene recognition algorithms are as follows:
  • Packet-based business scenario identification In an ad hoc network, traffic packets containing business scenario type information are sent between devices. By interpreting the contents of these traffic packets, business scenario identification can be realized.
  • the main purpose of the embodiments of the present invention is to propose a multi-service scene recognition method and a decision forest model training method, which can realize multi-service scene recognition and improve user experience.
  • an embodiment of the present invention provides a method for identifying multiple business scenarios, including: acquiring the data flow characteristics of message data of the business scenario to be identified; inputting the data flow characteristics into a pre-trained decision forest Model; the decision forest model includes N decision trees, and the decision tree is used to identify business scenarios of data flow characteristics; the N is a natural number greater than 0, and the training samples of the decision forest model include multiple business scenarios Data flow characteristics; according to the identification results of the N decision trees, obtain the business scenario of the message data.
  • an embodiment of the present invention also provides a training method for a decision forest model, including: obtaining training samples, the training samples include data flow characteristics of multiple business scenarios; The decision forest model is trained to obtain a trained decision forest model; wherein, the decision forest model includes N decision trees, and the decision trees are used to identify business scenarios of data flow characteristics; the N is a natural number greater than 0.
  • an embodiment of the present invention also provides an identification device for multiple business scenarios, including: a feature acquisition module, configured to acquire the data flow characteristics of message data of the business scenario to be identified; an input module, configured to The data flow characteristics are input into a pre-trained decision forest model; the decision forest model includes N decision trees, and the decision trees are used to identify business scenarios of data flow characteristics; the N is a natural number greater than 0, and the The training samples of the decision forest model include data flow characteristics of multiple business scenarios; the scenario acquisition module is configured to acquire the business scenarios of the message data according to the identification results of the N decision trees.
  • an embodiment of the present invention also provides a training device for a decision forest model, including: a sample acquisition module, configured to acquire training samples, the training samples include data flow characteristics of multiple business scenarios; the training module , is set to train the initial decision forest model according to the training samples to obtain a trained decision forest model; wherein, the decision forest model includes N decision trees, and the decision trees are used to identify the business of data flow characteristics Scenario; the N is a natural number greater than 0.
  • an embodiment of the present invention further provides an electronic device, including: at least one processor; and a memory connected to the at least one processor in communication; wherein, the memory stores information that can be used by the Instructions executed by at least one processor, the instructions are executed by the at least one processor, so that the at least one processor can execute the above-mentioned multi-service scene recognition method, or execute the above-mentioned decision forest model training method .
  • an embodiment of the present invention also provides a computer-readable storage medium, which stores a computer program, and when the computer program is executed by a processor, the above-mentioned multi-service scene identification method is realized, or, the above-mentioned A method for training a decision forest model.
  • the method for identifying multiple business scenarios proposed by the present invention first obtains the data flow characteristics of the message data of the business scenario to be identified, and inputs the data flow characteristics into a pre-trained decision forest model, wherein the decision forest model includes N decision Trees, decision trees are used to identify business scenarios of data flow characteristics, where N is a natural number greater than 0, and the training samples of the decision forest model include data flow characteristics of multiple business scenarios, according to the identification results of N decision trees, obtain Business scenarios for packet data.
  • the training samples of the decision forest model include the data flow characteristics of multiple business scenarios, and the business scenarios cover a wide range
  • the obtained data flow characteristics of the packet data are input into the trained decision forest model, and the The identification result of the decision tree in the decision forest model obtains the business scenario of the packet data, which realizes the identification of various business scenarios, and then provides corresponding WLAN parameter configuration and services according to the business scenario of the packet data, effectively improving the User experience.
  • FIG. 1 is a system architecture diagram provided according to an embodiment of the present invention
  • FIG. 2 is a structural diagram of an access point device provided according to an embodiment of the present invention.
  • FIG. 3 is a flowchart of a method for identifying multiple business scenarios according to an embodiment of the present invention
  • FIG. 4 is a structural diagram of a decision forest model provided according to an embodiment of the present invention.
  • FIG. 5 is a flow chart of a method for identifying multiple business scenarios according to another embodiment of the present invention.
  • Fig. 6 is the flowchart of the training method of a kind of decision forest model provided according to one embodiment of the present invention.
  • Fig. 7 is a schematic diagram of an identification device for a multi-service scenario provided according to an embodiment of the present invention.
  • FIG. 8 is a schematic diagram of a training device for a decision forest model provided according to an embodiment of the present invention.
  • Fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
  • An embodiment of the present invention relates to a multi-service scenario identification method, which is applied to an access point device.
  • the application scenario of the embodiment of the present invention may include but not limited to the system architecture shown in FIG. 1 , including: an access point device and multiple user equipments.
  • a WLAN has at least one access point device, the link between the user equipment and the access point device is not limited to wireless, and the access point device provides unlimited data access services for the user equipment.
  • an access point device can provide unlimited data access services for one or more user devices at the same time, and the form of the user device can be but not limited to a mobile phone, an Internet of Things terminal, or another access point device, the access point Communication between devices and user equipment may be based on, but not limited to, the 802.11 family of protocols.
  • the structure of the access point device may be as shown in FIG. 2 , specifically including: a radio frequency module, a physical link layer, a media access control layer, and an identification module for multiple service scenarios.
  • the radio frequency module and the physical link layer are configured to demodulate the wireless signal of the user equipment, and send the demodulated signal to the media access control layer.
  • the media control layer is configured to acquire the message data of the service scene to be identified sent by the user equipment to the Internet.
  • the identification module of multi-service scenarios is set to obtain the data flow characteristics of the message data of the business scenarios to be identified; input the data flow characteristics into the pre-trained decision forest model; wherein, the decision forest model includes N decision trees, and the decision tree uses Business scenarios for identifying data flow characteristics; N is a natural number greater than 0, and the training samples of the decision forest model include data flow characteristics of multiple business scenarios; according to the identification results of N decision trees, obtain the business scenario of message data.
  • Step 301 acquire the data flow characteristics of the message data of the business scene to be identified.
  • Step 302 input the data stream features into the pre-trained decision forest model.
  • Step 303 according to the identification result of the decision tree, obtain the business scenario of the message data.
  • the decision forest model includes N decision trees
  • the decision tree uses Business scenarios for identifying data flow characteristics, where N is a natural number greater than 0, and the training samples of the decision forest model include data flow characteristics of multiple business scenarios. According to the identification results of N decision trees, the business of obtaining message data Scenes.
  • the training samples of the decision forest model include the data flow characteristics of multiple business scenarios, and the business scenarios cover a wide range
  • the obtained data flow characteristics of the packet data are input into the trained decision forest model, and the The identification result of the decision tree in the decision forest model obtains the business scenario of the packet data, which realizes the identification of various business scenarios, and then provides corresponding WLAN parameter configuration and services according to the business scenario of the packet data, effectively improving the User experience.
  • the access point device identifies and counts the packet data sent by the user equipment, so as to obtain the data flow characteristics of the packet data.
  • the data flow characteristics include one or any combination of the following: source IP, source port, destination IP, destination port, protocol type, maximum packet size, minimum packet size, average packet size, variance of packet size, Maximum message exchange time, minimum message exchange time, average message exchange time.
  • the source IP is the Internet Protocol address of the data stream sender; the source port is the port address of the data stream sender; the destination IP is the Internet Protocol address of the data stream receiver; the destination port is the port address of the data stream receiver;
  • the type is the Internet protocol type, for example, Transmission Control Protocol/Internet Protocol (Transmission Control Protocol/Internet Protocol, TCP/IP protocol), User Datagram Protocol (User Datagram Protocol, UDP), or Internet Control Message Protocol (Internet Control Message Protocol, ICMP);
  • the maximum packet size and the minimum packet size are obtained directly from the media control layer in the access point device; the average packet size is calculated according to the maximum packet size and the minimum packet size; the variance of the packet size is calculated according to The maximum packet size, the minimum packet size and the average packet size are calculated; the maximum packet exchange time and the minimum packet exchange time are obtained by recording the access point device timer; the average packet exchange time is based on the maximum The text exchange time is calculated.
  • step 302 the access point device inputs the acquired data flow features into a pre-trained decision forest model.
  • the decision forest model is a data structure, which is obtained by training with the gradient boosting decision tree method.
  • the specific structure of the decision forest model is shown in Figure 4, including N decision trees, where N is a natural number greater than 0.
  • each decision tree is composed of nodes and leaf nodes.
  • Each node contains a logical judgment for a data flow feature. When the logical judgment is true, enter the corresponding subtree and judge the node logic of the subtree until reaching leaf node.
  • the number of leaf nodes of each decision tree is the same as the number of categories of business scenarios.
  • the recognition results of the N decision trees can be obtained.
  • the recognition result of the decision tree includes the recognized business scenario and the weight of the recognized business scenario.
  • the data flow characteristics include source IP and/or destination IP
  • the data flow characteristics include source IP and/or destination IP
  • the address of the source IP and/or destination IP is converted into binary data, and then the binary data is converted into an unsigned integer, and then the decimal precision normalization is performed on the binary data converted into an unsigned integer, to obtain the normalized The source IP and/or destination IP used to input the decision forest model.
  • K ip is 9.
  • the Decimal precision normalization is performed on the source port and/or target port to obtain the normalized source port and/or target port for input into the decision forest model, so as to improve the performance of business scenario recognition.
  • K port is the normalization coefficient of the source port and/or target port. In one example, the K port is 5.
  • the data flow characteristics include the protocol (Protocol) type
  • the protocol Protocol
  • the data flow characteristics include the protocol (Protocol) type
  • the mapping relationship between the protocol type and the integer after obtaining the data flow characteristics of the message data of the business scene to be identified, before inputting the data flow characteristics into the pre-trained decision forest model, the The mapping relationship between the protocol type and the integer, the currently obtained protocol type is mapped to the corresponding integer, and the corresponding integer is used as the protocol type input into the decision forest model, so as to improve the performance of business scenario recognition.
  • mapping function used is as follows:
  • Protocol is the protocol type name of the data stream characteristics.
  • normalized source IP and/or destination IP, source port and/or destination port obtained by the above method are only examples, and the specific normalization method may not be unique.
  • step 303 the access point device obtains the service scenario of the packet data according to the obtained identification results of the N decision trees.
  • this value represents the weight of the business scenario category corresponding to the leaf node under the input current message data, therefore, according to the N decision tree's For the recognition result, the weights of the identified business scenarios of the same category are accumulated to obtain the weight accumulation value of each recognized business scenario, and the business scenario with the highest weight accumulation value is used as the business scenario of the message data.
  • the method for identifying multiple business scenarios in the embodiment of the present invention realizes the identification of multiple business scenarios based on the decision forest model, and the recognition accuracy is relatively high.
  • the business scenarios of the message data are divided into the following 15 categories, namely: network access, system, download, web browsing, Voice, mail, streaming, social media, chat, remote writing, music, cloud storage, software upgrades, video and others.
  • each category has 2000 to 6000 different numbers of data flow samples, and the probability of defining a category of a scene is:
  • Table 1 embodies the detection probability of the multi-service scene identification method of the embodiment of the present invention for each service scene category:
  • class name Detection probability (%) Number of training samples (pieces) Number of test samples (pieces) apply all 87.14 280555 70052 network access 99.18 23852 6148 system 99.11 24129 5871 download 97.64 23940 6060 Web page 96.94 23975 6025 other 95.64 24057 5943
  • the identification accuracy rate of the multi-service scene identification method in the embodiment of the present invention is as high as 87.14%.
  • the detection probability of 6 categories exceeded 95%, and the detection probability of 11 categories exceeded 80%.
  • An overall high probability of detection is guaranteed.
  • FIG. 5 is a flow chart of the method for identifying multiple business scenarios described in this embodiment, specifically including:
  • Step 501 update the data flow features within the latest preset time period to the training samples of the decision forest model.
  • the access point device will request the user equipment to feed back the services within the latest preset time period, and use the data flow characteristics obtained within the latest preset time period as the corresponding data flow characteristics of the service, and The data flow characteristics within the latest preset time period are updated to the training samples of the decision forest model.
  • the data flow characteristics within the latest preset time period are calibrated with business scenarios, and the data volume of the data flow features within the latest preset time period is the same as the data volume of the training samples before the update, so as to avoid excessive reliance on the latest preset time period
  • the dataflow characteristic of the dataflow characteristic is the same as the data volume of the training samples before the update, so as to avoid excessive reliance on the latest preset time period.
  • the service scenario of data stream feature calibration can be implemented in the following manner.
  • the access point device uploads the original packet data fed back by the user equipment to the server under the permission of the user equipment, and uses deep packet inspection by the server.
  • Step 502 update the decision forest model according to the updated training samples.
  • the access point device uses the data flow characteristics within the latest preset time period to add a leaf node under the original decision tree node of the decision forest model to update the decision tree, thereby updating the decision forest model .
  • Step 503 acquire the data flow characteristics of the message data of the service scenario to be identified.
  • step 503 is substantially the same as step 301 and will not be repeated here.
  • Step 504 input data flow features into the updated decision forest model.
  • Step 505 according to the identification result of the decision tree, obtain the business scenario of the packet data.
  • the decision forest model can be optimized by updating the data flow characteristics within the latest preset time period to the training samples of the decision forest model, and updating the decision forest model according to the updated training samples, which further improves the Recognition performance of multiple business scenarios.
  • FIG. 6 is the implementation of the decision forest model described in this embodiment.
  • Step 601 acquire training samples.
  • the access point device obtains the data flow characteristics from the user equipment as the initial training samples of the decision forest model, where there may be one or more user equipment, and the training samples include the data flow characteristics of multiple business scenarios .
  • the data flow characteristics include one or any combination of the following: source IP, source port, destination IP, destination port, protocol type, maximum packet size, minimum packet size, average packet size, variance of packet size, maximum Message exchange time, minimum message exchange time, average message exchange time.
  • the source IP and/or destination IP is the source IP and/or destination IP converted by binary data and unsigned integer and normalized by decimal precision, and the source port and/or destination port is normalized by decimal precision
  • the source port and/or destination port, the protocol type is the integer mapped according to the preset mapping relationship between the protocol type and the integer, so as to improve the recognition performance of the decision forest model.
  • the specific test process adopted is the same as the test process in the first embodiment, and will not be repeated here.
  • the difference is that before the initial decision forest model is trained, the above-mentioned data processing is not performed on the acquired data flow characteristics, and then After the identification of multiple business scenarios is obtained, the detection probability of each business scenario category is shown in Table 2:
  • the overall recognition probability is reduced from 87.14% in Table 1 to 83.83%.
  • the number of business categories with a detection probability of more than 95% has been reduced from 6 to 4, and the number of business categories with a detection probability of more than 80% has been reduced from 11 to 8.
  • individual businesses, such as social media business have a drop rate of detection probability as high as 7.24%.
  • the training samples use the processed data flow features, which can effectively improve the business recognition performance.
  • Step 602 Train the initial decision forest model according to the training samples to obtain a trained decision forest model.
  • the access point device uses the gradient boosting decision tree method to train the initial decision forest model according to the obtained training samples, so as to obtain a trained decision forest model.
  • the decision forest model includes N decision trees, which are used to identify business scenarios of data flow characteristics; N is a natural number greater than 0.
  • the specific training process is as follows: first define xi as the i-th training data, define y i as the business scenario category corresponding to the i-th training data, and define is the prediction result of the model for the i-th training data.
  • T is the number of decision trees
  • f t is the function of the tth decision tree
  • t 1, ..., T, to determine the training parameters of the decision tree when the recognition accuracy of the decision forest model is the highest.
  • ⁇ t as the parameter of f t , that is, when the recognition accuracy of the decision forest model is the highest, the training parameters of the decision tree can be obtained by the following calculation formula
  • the training sample is iterated for T times, and in each iteration, based on the training parameters used in the previous iteration, a new one is added such that the objective function descending tree, and define The prediction result of the model updated for the t-th iteration for the i-th training data.
  • a new one is added such that the objective function descending tree, and define The prediction result of the model updated for the t-th iteration for the i-th training data.
  • the recognition accuracy of the decision forest model can be improved by obtaining the optimal training parameters, that is, the training parameters of the decision tree, for example, the number of decision trees, during the training of the decision forest model.
  • step division of the above various methods is only for the sake of clarity of description. During implementation, it can be combined into one step or some steps can be split and decomposed into multiple steps. As long as they include the same logical relationship, they are all within the scope of protection of this patent. ; Adding insignificant modifications or introducing insignificant designs to the algorithm or process, but not changing the core design of the algorithm and process are all within the scope of protection of this patent.
  • FIG. 7 is a schematic diagram of an apparatus for identifying multiple business scenarios described in this embodiment, including: a feature acquisition module 701 , an input module 702 and a scenario acquisition module 703 .
  • the feature acquisition module 701 is configured to acquire the data flow features of the message data of the service scenario to be identified.
  • the input module 702 is configured to input data flow characteristics into a pre-trained decision forest model; wherein, the decision forest model includes N decision trees, and the decision trees are set to identify business scenarios of data flow characteristics; N is a natural number greater than 0, and the decision
  • the training samples of the forest model include data flow characteristics of multiple business scenarios.
  • the input module 702 is also configured to convert the address of the source IP and/or the destination IP into binary data when the data flow characteristics include the source IP and/or the destination IP; convert the binary data into an unsigned integer; Perform decimal precision normalization for unsigned integer binary data, and after obtaining the normalized source IP and/or destination IP for input into the decision forest model, input the normalized source IP and/or destination IP Pre-trained decision forest models.
  • the input module 702 is also configured to perform decimal precision normalization on the source port and/or the target port when the data flow characteristics include the source port and/or the target port, and obtain the normalized input decision forest After the source port and/or target port of the model, the normalized source port and/or target port are input into the pre-trained decision forest model.
  • the input module 702 is also configured to input the corresponding integer after the protocol type currently obtained is mapped to the corresponding integer according to the preset mapping relationship between the protocol type and the integer after the data flow feature is included in the protocol type.
  • Pre-trained decision forest models Pre-trained decision forest models.
  • the scenario acquisition module 703 is configured to acquire the business scenario of the message data according to the identification results of the N decision trees.
  • the scenario acquisition module 703 is also configured to accumulate the weights of identified business scenarios of the same category according to the identification results of the N decision trees to obtain the weight accumulation value of each identified business scenario; the weights are accumulated The business scenario with the highest value is used as the business scenario of the packet data.
  • FIG. 8 is a schematic diagram of a training device for the decision forest model described in this embodiment, including: a sample acquisition module 801 and a training module 802 .
  • the sample acquisition module 801 is configured to acquire training samples, wherein the training samples include data flow characteristics of multiple business scenarios.
  • the training module 802 is set to train the initial decision forest model according to the training samples to obtain the trained decision forest model; wherein, the decision forest model includes N decision trees, and the decision trees are used to identify business scenarios of data flow characteristics; N is a natural number greater than 0.
  • this embodiment is an apparatus embodiment corresponding to the above embodiment of the method for identifying multiple service scenarios, and this embodiment can be implemented in cooperation with the above method embodiment.
  • the relevant technical details and technical effects mentioned in the above embodiments are still valid in this embodiment, and will not be repeated here to reduce repetition.
  • the relevant technical details mentioned in this embodiment can also be applied in the above embodiments.
  • a logical unit can be a physical unit or a part of a physical unit, and can also Combination of physical units.
  • this embodiment does not introduce units that are not closely related to solving the technical problems raised by the embodiment of the present invention, but this does not mean that there are no other elements in this embodiment unit.
  • FIG. 9 Another embodiment of the present invention relates to an electronic device, as shown in FIG. 9 , including: at least one processor 901; and a memory 902 communicatively connected to the at least one processor 901; wherein, the memory 902 stores Instructions that can be executed by the at least one processor 901, the instructions are executed by the at least one processor 901, so that the at least one processor 901 can execute the multi-service scene identification method and A method for training a decision forest model.
  • the memory and the processor are connected by a bus
  • the bus may include any number of interconnected buses and bridges, and the bus connects one or more processors and various circuits of the memory together.
  • the bus may also connect together various other circuits such as peripherals, voltage regulators, and power management circuits, all of which are well known in the art and therefore will not be further described herein.
  • the bus interface provides an interface between the bus and the transceivers.
  • a transceiver may be a single element or multiple elements, such as multiple receivers and transmitters, providing means for communicating with various other devices over a transmission medium.
  • the data processed by the processor is transmitted on the wireless medium through the antenna, further, the antenna also receives the data and transmits the data to the processor.
  • the processor manages the bus and general processing, and can also provide various functions including timing, peripheral interfacing, voltage regulation, power management, and other control functions. Instead, memory can be used to store data that the processor uses when performing operations.
  • Another embodiment of the present invention relates to a computer-readable storage medium storing a computer program.
  • the above method embodiments are implemented when the computer program is executed by the processor.
  • the program is stored in a storage medium, and includes several instructions to make a device ( It may be a single-chip microcomputer, a chip, etc.) or a processor (processor) to execute all or part of the steps of the methods described in the various embodiments of the present invention.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disk or optical disc, etc., which can store program codes. .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Computational Linguistics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

Embodiments of the present invention relate to the technical field of wireless local area networks. Disclosed are a multi-service scenario identification method and a decision forest model training method. The multi-service scenario identification method comprises: obtaining a data stream feature of packet data of a service scenario to be identified; inputting the data stream feature into a pre-trained decision forest model, the decision forest model comprising N decision trees, the decision trees being used for identifying service scenarios corresponding to data stream features, N being a natural number greater than 0, and a training sample of the decision forest model comprising data stream features of a plurality of service scenarios; and obtaining, according to identification results of the N decision trees, a service scenario corresponding to the packet data. According to the multi-service scenario identification method provided in the embodiments of the present invention, multi-service scenario identification can be achieved, and the use experience of a user is improved.

Description

多业务场景的识别方法和决策森林模型的训练方法Recognition method of multiple business scenarios and training method of decision forest model 技术领域technical field
本发明实施例涉及无线局域网技术领域,特别涉及一种多业务场景的识别方法和决策森林模型的训练方法。Embodiments of the present invention relate to the technical field of wireless local area networks, and in particular to a multi-service scene identification method and a decision forest model training method.
背景技术Background technique
业务场景识别算法允许(Wireless Local Area Network,WLAN)接入点和设备感知当前用户正在访问的应用、服务和场景,比如游戏、语音、视频、直播等。然后根据应用、服务和场景的特性,接入点为用户提供不同的WLAN参数配置和服务,能够最大化用户的网络接入体验。例如,优先地发送游戏报文,为用户提供低延时和低卡顿的游戏体验。在这之中,业务场景识别算法作为整个流程的基石,它的性能至关重要。The business scene recognition algorithm allows (Wireless Local Area Network, WLAN) access points and devices to perceive the applications, services and scenes that the current user is accessing, such as games, voice, video, live broadcast, etc. Then, according to the characteristics of applications, services, and scenarios, the access point provides users with different WLAN parameter configurations and services, which can maximize the user's network access experience. For example, game packets are sent with priority to provide users with a low-latency and low-jamming gaming experience. Among them, the business scene recognition algorithm is the cornerstone of the entire process, and its performance is crucial.
目前业务场景识别算法有如下几种:The current business scene recognition algorithms are as follows:
(1)基于报文的业务场景识别:在专有网络中,设备之间会发送包含了业务场景类型信息的流量包。通过解读这些流量包中的内容,可以实现业务场景识别。(1) Packet-based business scenario identification: In an ad hoc network, traffic packets containing business scenario type information are sent between devices. By interpreting the contents of these traffic packets, business scenario identification can be realized.
(2)基于访问特征的业务场景识别:特定场景,比如网页浏览或者游戏,会访问特定的端口和IP。通过建立访问特征和业务场景的一一映射库,实现业务场景识别。(2) Business scenario identification based on access characteristics: specific scenarios, such as web browsing or games, will access specific ports and IPs. By establishing a one-to-one mapping library of access features and business scenarios, business scenario recognition is realized.
(3)基于深度包检测技术的业务场景识别:该方法对于每个场景在流量中携带的特定特征字段进行提取,然后将流量的内容与事先建立的特征库相比较,从而判断当前流量的业务场景。(3) Business scene recognition based on deep packet inspection technology: This method extracts the specific feature field carried in the traffic of each scene, and then compares the content of the traffic with the pre-established feature library to judge the business of the current traffic Scenes.
然而,上述的基于报文的业务场景识别算法、基于访问特征的业务场景识别算法和基于深度包检测技术的业务场景识别算法只能针对特定的业务场景进行识别,不能实现多业务场景的识别。However, the above-mentioned packet-based business scene recognition algorithm, access feature-based business scene recognition algorithm and deep packet inspection technology-based business scene recognition algorithm can only identify specific business scenarios, and cannot realize the identification of multiple business scenarios.
发明内容Contents of the invention
本发明实施例的主要目的在于提出一种多业务场景的识别方法和决策森林模型的训练方法,可以实现多业务场景的识别,提升用户的使用体验。The main purpose of the embodiments of the present invention is to propose a multi-service scene recognition method and a decision forest model training method, which can realize multi-service scene recognition and improve user experience.
为至少实现上述目的,本发明实施例提供了一种多业务场景的识别方法,包括:获取待识别业务场景的报文数据的数据流特征;将所述数据流特征输入预先训练好的决策森林模型;所述决策森林模型包括N个决策树,所述决策树用于识别数据流特征的业务场景;所述N为大于0的自然数,所述决策森林模型的训练样本包括多个业务场景的数据流特征;根据所述N个决策树的识别结果,获取所述报文数据的业务场景。In order to at least achieve the above purpose, an embodiment of the present invention provides a method for identifying multiple business scenarios, including: acquiring the data flow characteristics of message data of the business scenario to be identified; inputting the data flow characteristics into a pre-trained decision forest Model; the decision forest model includes N decision trees, and the decision tree is used to identify business scenarios of data flow characteristics; the N is a natural number greater than 0, and the training samples of the decision forest model include multiple business scenarios Data flow characteristics; according to the identification results of the N decision trees, obtain the business scenario of the message data.
为至少实现上述目的,本发明实施例还提供了一种决策森林模型的训练方法,包括:获取训练样本,所述训练样本包括多个业务场景的数据流特征;根据所述训练样本对初始的决策森林模型进行训练,得到训练好的决策森林模型;其中,所述决策森林模型包括N个决策树,所述决策树用于识别数据流特征的业务场景;所述N为大于0的自然数。In order to at least achieve the above purpose, an embodiment of the present invention also provides a training method for a decision forest model, including: obtaining training samples, the training samples include data flow characteristics of multiple business scenarios; The decision forest model is trained to obtain a trained decision forest model; wherein, the decision forest model includes N decision trees, and the decision trees are used to identify business scenarios of data flow characteristics; the N is a natural number greater than 0.
为至少实现上述目的,本发明实施例还提供了一种多业务场景的识别装置,包括:特征获取模块,设置为获取待识别业务场景的报文数据的数据流特征;输入模块,设置为将所述 数据流特征输入预先训练好的决策森林模型;所述决策森林模型包括N个决策树,所述决策树用于识别数据流特征的业务场景;所述N为大于0的自然数,所述决策森林模型的训练样本包括多个业务场景的数据流特征;场景获取模块,设置为根据所述N个决策树的识别结果,获取所述报文数据的业务场景。In order to at least achieve the above object, an embodiment of the present invention also provides an identification device for multiple business scenarios, including: a feature acquisition module, configured to acquire the data flow characteristics of message data of the business scenario to be identified; an input module, configured to The data flow characteristics are input into a pre-trained decision forest model; the decision forest model includes N decision trees, and the decision trees are used to identify business scenarios of data flow characteristics; the N is a natural number greater than 0, and the The training samples of the decision forest model include data flow characteristics of multiple business scenarios; the scenario acquisition module is configured to acquire the business scenarios of the message data according to the identification results of the N decision trees.
为至少实现上述目的,本发明实施例还提供了一种决策森林模型的训练装置,包括:样本获取模块,设置为获取训练样本,所述训练样本包括多个业务场景的数据流特征;训练模块,设置为根据所述训练样本对初始的决策森林模型进行训练,得到训练好的决策森林模型;其中,所述决策森林模型包括N个决策树,所述决策树用于识别数据流特征的业务场景;所述N为大于0的自然数。In order to at least achieve the above object, an embodiment of the present invention also provides a training device for a decision forest model, including: a sample acquisition module, configured to acquire training samples, the training samples include data flow characteristics of multiple business scenarios; the training module , is set to train the initial decision forest model according to the training samples to obtain a trained decision forest model; wherein, the decision forest model includes N decision trees, and the decision trees are used to identify the business of data flow characteristics Scenario; the N is a natural number greater than 0.
为至少实现上述目的,本发明实施例还提供了一种电子设备,包括:至少一个处理器;以及,与所述至少一个处理器通信连接的存储器;其中,所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行上述的多业务场景的识别方法,或者,执行上述的决策森林模型的训练方法。In order to at least achieve the above purpose, an embodiment of the present invention further provides an electronic device, including: at least one processor; and a memory connected to the at least one processor in communication; wherein, the memory stores information that can be used by the Instructions executed by at least one processor, the instructions are executed by the at least one processor, so that the at least one processor can execute the above-mentioned multi-service scene recognition method, or execute the above-mentioned decision forest model training method .
为至少实现上述目的,本发明实施例还提供了一种计算机可读存储介质,存储有计算机程序,所述计算机程序被处理器执行时实现上述的多业务场景的识别方法,或者,实现上述的决策森林模型的训练方法。In order to at least achieve the above purpose, an embodiment of the present invention also provides a computer-readable storage medium, which stores a computer program, and when the computer program is executed by a processor, the above-mentioned multi-service scene identification method is realized, or, the above-mentioned A method for training a decision forest model.
本发明提出的多业务场景的识别方法,首先获取待识别业务场景的报文数据的数据流特征,将数据流特征输入到预先训练好的决策森林模型中,其中,决策森林模型包括N个决策树,决策树用于识别数据流特征的业务场景,其中,N为大于0的自然数,并且决策森林模型的训练样本包括多个业务场景的数据流特征,根据N个决策树的识别结果,获取报文数据的业务场景。由于决策森林模型的训练样本包括多个业务场景的数据流特征,业务场景的覆盖范围广泛,因此,将获取到的报文数据的数据流特征输入到训练好的决策森林模型中,即可根据决策森林模型中的决策树的识别结果,得到报文数据的业务场景,即实现了多种业务场景的识别,然后根据报文数据的业务场景提供对应的WLAN参数配置和服务,有效地提升了用户的使用体验。The method for identifying multiple business scenarios proposed by the present invention first obtains the data flow characteristics of the message data of the business scenario to be identified, and inputs the data flow characteristics into a pre-trained decision forest model, wherein the decision forest model includes N decision Trees, decision trees are used to identify business scenarios of data flow characteristics, where N is a natural number greater than 0, and the training samples of the decision forest model include data flow characteristics of multiple business scenarios, according to the identification results of N decision trees, obtain Business scenarios for packet data. Since the training samples of the decision forest model include the data flow characteristics of multiple business scenarios, and the business scenarios cover a wide range, the obtained data flow characteristics of the packet data are input into the trained decision forest model, and the The identification result of the decision tree in the decision forest model obtains the business scenario of the packet data, which realizes the identification of various business scenarios, and then provides corresponding WLAN parameter configuration and services according to the business scenario of the packet data, effectively improving the User experience.
附图说明Description of drawings
一个或多个实施例通过与之对应的附图中的图片进行示例性说明,这些示例性说明并不构成对实施例的限定,附图中具有相同参考数字标识的元件表示为类似的元件,除非有特别的申明,附图中的图不构成比例限制。One or more embodiments are exemplified by corresponding pictures in the drawings, and these exemplifications do not constitute a limitation to the embodiments, and elements with the same reference numerals in the drawings represent similar elements, Unless otherwise stated, the drawings in the drawings are not limited to scale.
图1是根据本发明一个实施例提供的一种系统架构图;FIG. 1 is a system architecture diagram provided according to an embodiment of the present invention;
图2是根据本发明一个实施例提供的一种接入点设备的结构图;FIG. 2 is a structural diagram of an access point device provided according to an embodiment of the present invention;
图3是根据本发明一个实施例提供的一种多业务场景的识别方法的流程图;FIG. 3 is a flowchart of a method for identifying multiple business scenarios according to an embodiment of the present invention;
图4是根据本发明一个实施例提供的一种决策森林模型结构图;FIG. 4 is a structural diagram of a decision forest model provided according to an embodiment of the present invention;
图5是根据本发明另一个实施例提供的一种多业务场景的识别方法的流程图;FIG. 5 is a flow chart of a method for identifying multiple business scenarios according to another embodiment of the present invention;
图6是根据本发明一个实施例提供的一种决策森林模型的训练方法的流程图;Fig. 6 is the flowchart of the training method of a kind of decision forest model provided according to one embodiment of the present invention;
图7是根据本发明一个实施例提供的一种多业务场景的识别装置的示意图;Fig. 7 is a schematic diagram of an identification device for a multi-service scenario provided according to an embodiment of the present invention;
图8是根据本发明一个实施例提供的一种决策森林模型的训练装置的示意图;FIG. 8 is a schematic diagram of a training device for a decision forest model provided according to an embodiment of the present invention;
图9是根据本发明一个实施例提供的一种电子设备的结构示意图。Fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
具体实施方式Detailed ways
为使本发明实施例的目的、技术方案和优点更加清楚,下面将结合附图对本发明的各实施例进行详细的阐述。然而,本领域的普通技术人员可以理解,在本发明各实施例中,为了使读者更好地理解本发明而提出了许多技术细节。但是,即使没有这些技术细节和基于以下各实施例的种种变化和修改,也可以实现本发明所要求保护的技术方案。以下各个实施例的划分是为了描述方便,不应对本发明的具体实现方式构成任何限定,各个实施例在不矛盾的前提下可以相互结合相互引用。In order to make the purpose, technical solutions and advantages of the embodiments of the present invention clearer, the embodiments of the present invention will be described in detail below with reference to the accompanying drawings. However, those of ordinary skill in the art can understand that in each embodiment of the present invention, many technical details are provided for readers to better understand the present invention. However, even without these technical details and various changes and modifications based on the following embodiments, the technical solution claimed in the present invention can also be realized. The division of the following embodiments is for the convenience of description, and should not constitute any limitation to the specific implementation of the present invention, and the various embodiments can be combined and referred to each other on the premise of no contradiction.
本发明的一个实施例涉及一种多业务场景的识别方法,应用于接入点设备。本发明的实施例的应用场景可以包括但不限于如图1所示的系统架构中,包括:接入点设备和多个用户设备。An embodiment of the present invention relates to a multi-service scenario identification method, which is applied to an access point device. The application scenario of the embodiment of the present invention may include but not limited to the system architecture shown in FIG. 1 , including: an access point device and multiple user equipments.
具体地,一个WLAN有至少一个接入点设备,用户设备与接入点设备之间的链接不限于无线,接入点设备为用户设备提供无限的数据接入服务。Specifically, a WLAN has at least one access point device, the link between the user equipment and the access point device is not limited to wireless, and the access point device provides unlimited data access services for the user equipment.
其中,一个接入点设备可以同时为一个或者多个用户设备提供无限的数据接入服务,用户设备的形式可以是但不限于手机,物联网终端,或者另一个接入点设备,接入点设备和用户设备之间的通信可以基于但不限于802.11系列协议。Among them, an access point device can provide unlimited data access services for one or more user devices at the same time, and the form of the user device can be but not limited to a mobile phone, an Internet of Things terminal, or another access point device, the access point Communication between devices and user equipment may be based on, but not limited to, the 802.11 family of protocols.
具体实现中,接入点设备的结构可以如图2所示,具体包括:射频模块,物理链路层,媒体接入控制层和多业务场景的识别模块。In a specific implementation, the structure of the access point device may be as shown in FIG. 2 , specifically including: a radio frequency module, a physical link layer, a media access control layer, and an identification module for multiple service scenarios.
其中,射频模块和物理链路层,设置为对用户设备的无线信号进行解调,并将解调后的信号送入媒体接入控制层。Wherein, the radio frequency module and the physical link layer are configured to demodulate the wireless signal of the user equipment, and send the demodulated signal to the media access control layer.
媒体控制层,设置为获取用户设备向互联网发送的待识别业务场景的报文数据。The media control layer is configured to acquire the message data of the service scene to be identified sent by the user equipment to the Internet.
多业务场景的识别模块,设置为获取待识别业务场景的报文数据的数据流特征;将数据流特征输入预先训练好的决策森林模型;其中,决策森林模型包括N个决策树,决策树用于识别数据流特征的业务场景;N为大于0的自然数,决策森林模型的训练样本包括多个业务场景的数据流特征;根据N个决策树的识别结果,获取报文数据的业务场景。The identification module of multi-service scenarios is set to obtain the data flow characteristics of the message data of the business scenarios to be identified; input the data flow characteristics into the pre-trained decision forest model; wherein, the decision forest model includes N decision trees, and the decision tree uses Business scenarios for identifying data flow characteristics; N is a natural number greater than 0, and the training samples of the decision forest model include data flow characteristics of multiple business scenarios; according to the identification results of N decision trees, obtain the business scenario of message data.
本实施例的多业务场景的识别方法的实现流程图如图3所示,具体包括:The implementation flowchart of the identification method for multiple business scenarios in this embodiment is shown in Figure 3, specifically including:
步骤301,获取待识别业务场景的报文数据的数据流特征。 Step 301, acquire the data flow characteristics of the message data of the business scene to be identified.
步骤302,将数据流特征输入预先训练好的决策森林模型。 Step 302, input the data stream features into the pre-trained decision forest model.
步骤303,根据决策树的识别结果,获取报文数据的业务场景。 Step 303, according to the identification result of the decision tree, obtain the business scenario of the message data.
在本实施例中,首先获取待识别业务场景的报文数据的数据流特征,将数据流特征输入到预先训练好的决策森林模型中,其中,决策森林模型包括N个决策树,决策树用于识别数据流特征的业务场景,其中,N为大于0的自然数,并且决策森林模型的训练样本包括多个业务场景的数据流特征,根据N个决策树的识别结果,获取报文数据的业务场景。由于决策森林模型的训练样本包括多个业务场景的数据流特征,业务场景的覆盖范围广泛,因此,将获取到的报文数据的数据流特征输入到训练好的决策森林模型中,即可根据决策森林模型中的决策树的识别结果,得到报文数据的业务场景,即实现了多种业务场景的识别,然后根据报文数据的业务场景提供对应的WLAN参数配置和服务,有效地提升了用户的使用体验。In this embodiment, first obtain the data flow characteristics of the message data of the service scene to be identified, and input the data flow characteristics into the pre-trained decision forest model, wherein the decision forest model includes N decision trees, and the decision tree uses Business scenarios for identifying data flow characteristics, where N is a natural number greater than 0, and the training samples of the decision forest model include data flow characteristics of multiple business scenarios. According to the identification results of N decision trees, the business of obtaining message data Scenes. Since the training samples of the decision forest model include the data flow characteristics of multiple business scenarios, and the business scenarios cover a wide range, the obtained data flow characteristics of the packet data are input into the trained decision forest model, and the The identification result of the decision tree in the decision forest model obtains the business scenario of the packet data, which realizes the identification of various business scenarios, and then provides corresponding WLAN parameter configuration and services according to the business scenario of the packet data, effectively improving the User experience.
下面对本实施例的多业务场景的识别方法的实现细节进行具体的说明,以下内容仅为方便理解提供的实现细节,并非实施本方案的必须。The implementation details of the method for identifying multiple service scenarios in this embodiment are described in detail below. The following content is only implementation details provided for easy understanding, and is not necessary for implementing this solution.
在步骤301中,接入点设备对用户设备发送的报文数据进行识别和统计,以获取报文数据的数据流特征。In step 301, the access point device identifies and counts the packet data sent by the user equipment, so as to obtain the data flow characteristics of the packet data.
其中,数据流特征包括以下之一或其任意组合:源IP、源端口、目的IP、目标端口、协议类型、最大报文大小、最小报文大小、平均报文大小、报文大小的方差、最大报文交换时间、最小报文交换时间、平均报文交换时间。Among them, the data flow characteristics include one or any combination of the following: source IP, source port, destination IP, destination port, protocol type, maximum packet size, minimum packet size, average packet size, variance of packet size, Maximum message exchange time, minimum message exchange time, average message exchange time.
具体地,源IP为数据流发送方的互联网协议地址;源端口为数据流发送方的端口地址;目的IP为数据流接收方的互联网协议地址;目标端口为数据流接收方的端口地址;协议类型为互联网协议类型,例如,传输控制协议/网际协议(Transmission Control Protocol/Internet Protocol,TCP/IP协议),用户数据包协议(User Datagram Protocol,UDP),或者Internet控制报文协议(Internet Control Message Protocol,ICMP);最大报文大小和最小报文大小从接入点设备中的媒体控制层直接获取;平均报文大小根据最大报文大小和最小报文大小计算得到;报文大小的方差根据最大报文大小、最小报文大小和平均报文大小计算得到;最大报文交换时间和最小交换时间通过接入点设备定时器记录得到;平均报文交换时间根据最大报文交换时间和最小报文交换时间计算得到。Specifically, the source IP is the Internet Protocol address of the data stream sender; the source port is the port address of the data stream sender; the destination IP is the Internet Protocol address of the data stream receiver; the destination port is the port address of the data stream receiver; The type is the Internet protocol type, for example, Transmission Control Protocol/Internet Protocol (Transmission Control Protocol/Internet Protocol, TCP/IP protocol), User Datagram Protocol (User Datagram Protocol, UDP), or Internet Control Message Protocol (Internet Control Message Protocol, ICMP); the maximum packet size and the minimum packet size are obtained directly from the media control layer in the access point device; the average packet size is calculated according to the maximum packet size and the minimum packet size; the variance of the packet size is calculated according to The maximum packet size, the minimum packet size and the average packet size are calculated; the maximum packet exchange time and the minimum packet exchange time are obtained by recording the access point device timer; the average packet exchange time is based on the maximum The text exchange time is calculated.
在步骤302中,接入点设备将获取的数据流特征输入到预先训练好的决策森林模型中。In step 302, the access point device inputs the acquired data flow features into a pre-trained decision forest model.
其中,决策森林模型为一种数据结构,采用梯度提升决策树方法训练得到。决策森林模型的具体结构如图4所示,包括N个决策树,N为大于0的自然数。Among them, the decision forest model is a data structure, which is obtained by training with the gradient boosting decision tree method. The specific structure of the decision forest model is shown in Figure 4, including N decision trees, where N is a natural number greater than 0.
具体地,每一个决策树由节点和叶节点构成,每一个节点包含对于一个数据流特征的逻辑判断,当中逻辑判断为真时,进入相应的子树,并判断子树的节点逻辑,直到到达叶节点。其中,每一个决策树的叶节点的数量与业务场景的类别数量相同。Specifically, each decision tree is composed of nodes and leaf nodes. Each node contains a logical judgment for a data flow feature. When the logical judgment is true, enter the corresponding subtree and judge the node logic of the subtree until reaching leaf node. Wherein, the number of leaf nodes of each decision tree is the same as the number of categories of business scenarios.
通过将数据流特征输入到上述的包括N个决策树的决策森林模型中,可以得到N个决策树的识别结果。其中,决策树的识别结果包括识别到的业务场景和识别到的业务场景的权重。By inputting the characteristics of the data stream into the aforementioned decision forest model including N decision trees, the recognition results of the N decision trees can be obtained. Wherein, the recognition result of the decision tree includes the recognized business scenario and the weight of the recognized business scenario.
在一个例子中,当数据流特征包括源IP和/或目的IP,在获取待识别业务场景的报文数据的数据流特征后,将数据流特征输入预先训练好的决策森林模型之前,会对获取的源IP和/或目的IP进行数据处理,以提升业务场景识别的性能。In an example, when the data flow characteristics include source IP and/or destination IP, after obtaining the data flow characteristics of the message data of the service scenario to be identified, before inputting the data flow characteristics into the pre-trained decision forest model, the The acquired source IP and/or destination IP are processed for data to improve the performance of business scene recognition.
具体地,将源IP和/或目的IP的地址转换为二进制数据,然后将二进制数据转化为无符号整形,再对转化为无符号整形的二进制数据进行小数精度归一化,得到归一化后的用于输入决策森林模型的源IP和/或目的IP。Specifically, the address of the source IP and/or destination IP is converted into binary data, and then the binary data is converted into an unsigned integer, and then the decimal precision normalization is performed on the binary data converted into an unsigned integer, to obtain the normalized The source IP and/or destination IP used to input the decision forest model.
其中,归一化后的源IP和/或目的IP的计算公式为:Wherein, the calculation formula of the normalized source IP and/or destination IP is:
Figure PCTCN2022118249-appb-000001
Figure PCTCN2022118249-appb-000001
Figure PCTCN2022118249-appb-000002
为归一化后的源IP和/或目的IP,IP int是整数化的源IP和/或目的IP,K IP是源IP和/或目的IP的归一化系数。在一个例子中,K ip为9。
Figure PCTCN2022118249-appb-000002
is the normalized source IP and/or destination IP, IP int is an integerized source IP and/or destination IP, and K IP is a normalization coefficient of the source IP and/or destination IP. In one example, K ip is 9.
在一个例子中,当数据流特征包括源端口和/或目标端口,在获取待识别业务场景的报文数据的数据流特征后,将数据流特征输入预先训练好的决策森林模型之前,会对源端口和/或目标端口进行小数精度归一化,得到归一化后的用于输入决策森林模型的源端口和/或目标端口,以提升业务场景识别的性能。In an example, when the data flow characteristics include source port and/or destination port, after obtaining the data flow characteristics of the message data of the business scenario to be identified, before inputting the data flow characteristics into the pre-trained decision forest model, the Decimal precision normalization is performed on the source port and/or target port to obtain the normalized source port and/or target port for input into the decision forest model, so as to improve the performance of business scenario recognition.
其中,归一化后的源端口和/或目标端口的计算公式为:Wherein, the calculation formula of the normalized source port and/or target port is:
Figure PCTCN2022118249-appb-000003
Figure PCTCN2022118249-appb-000003
Figure PCTCN2022118249-appb-000004
是归一化后的源端口和/或目标端口,port源端口和/或目标端口,K port是源端口和/或目标端口的归一化系数。在一个例子中,K port为5。
Figure PCTCN2022118249-appb-000004
is the normalized source port and/or target port, port is the source port and/or target port, and K port is the normalization coefficient of the source port and/or target port. In one example, the K port is 5.
在一个例子中,当数据流特征包括协议(Protocol)类型,在获取待识别业务场景的报文数据的数据流特征后,将数据流特征输入预先训练好的决策森林模型之前,会根据预设的协议类型与整数的映射关系,将当前获取的协议类型映射到对应的整数,并将所述对应的整数作为输入所述决策森林模型的协议类型,以提升业务场景识别的性能。In one example, when the data flow characteristics include the protocol (Protocol) type, after obtaining the data flow characteristics of the message data of the business scene to be identified, before inputting the data flow characteristics into the pre-trained decision forest model, the The mapping relationship between the protocol type and the integer, the currently obtained protocol type is mapped to the corresponding integer, and the corresponding integer is used as the protocol type input into the decision forest model, so as to improve the performance of business scenario recognition.
其中,所采用的映射函数如下:Among them, the mapping function used is as follows:
Figure PCTCN2022118249-appb-000005
Figure PCTCN2022118249-appb-000005
Pro int为映射为整数后的协议类型,Protocol是数据流特征的协议类型名称。 Pro int is the protocol type mapped to an integer, and Protocol is the protocol type name of the data stream characteristics.
需要说明的是,采用上述方式得到归一化后的源IP和/或目的IP,源端口和/或目标端口仅为举例说明,具体归一化的方式可不唯一。It should be noted that the normalized source IP and/or destination IP, source port and/or destination port obtained by the above method are only examples, and the specific normalization method may not be unique.
在步骤303中,接入点设备根据得到的N个决策树的识别结果,获取报文数据的业务场景。In step 303, the access point device obtains the service scenario of the packet data according to the obtained identification results of the N decision trees.
具体地,由于决策森林模型中每一个决策树的叶节点包含一个数值,这个数值表示在输入当前报文数据下,叶节点所对应的业务场景类别的权重,因此,可以根据N个决策树的识别结果,将识别到的相同类别的业务场景的权重进行累加,得到识别到的各业务场景的权重累加值,将权重累加值最高的业务场景作为报文数据的业务场景。Specifically, since the leaf node of each decision tree in the decision forest model contains a value, this value represents the weight of the business scenario category corresponding to the leaf node under the input current message data, therefore, according to the N decision tree's For the recognition result, the weights of the identified business scenarios of the same category are accumulated to obtain the weight accumulation value of each recognized business scenario, and the business scenario with the highest weight accumulation value is used as the business scenario of the message data.
值得一提的是,本发明实施例的多业务场景的识别方法,基于决策森林模型实现多业务场景的识别,识别的准确率较高。It is worth mentioning that the method for identifying multiple business scenarios in the embodiment of the present invention realizes the identification of multiple business scenarios based on the decision forest model, and the recognition accuracy is relatively high.
在一个例子中,为了体现本发明实施例的多业务场景的识别方法的识别准确率较高,进行了以下测试:In one example, in order to reflect the higher recognition accuracy of the multi-service scene recognition method of the embodiment of the present invention, the following tests were carried out:
通过网络封包分析软件Wireshark进行网络流量抓包,并对于报文数据进行了流量分析和标定,将报文数据的业务场景分为如下15类,分别为:网络访问、系统、下载、网页浏览、语音、邮件、流媒体、社交媒体、聊天、远程写作、音乐、云存储、软件升级、视频和其他。其中,每一类具有2000到6000个不同数量的数据流样本,定义一个场景的类别的概率为:Use the network packet analysis software Wireshark to capture network traffic, and analyze and calibrate the traffic of the message data. The business scenarios of the message data are divided into the following 15 categories, namely: network access, system, download, web browsing, Voice, mail, streaming, social media, chat, remote writing, music, cloud storage, software upgrades, video and others. Among them, each category has 2000 to 6000 different numbers of data flow samples, and the probability of defining a category of a scene is:
Figure PCTCN2022118249-appb-000006
Figure PCTCN2022118249-appb-000006
表1体现了本发明实施例的多业务场景的识别方法针对各个业务场景类别的检测概率:Table 1 embodies the detection probability of the multi-service scene identification method of the embodiment of the present invention for each service scene category:
表1Table 1
类名class name 检测概率(%)Detection probability (%) 训练样本数量(个)Number of training samples (pieces) 检测样本数量(个)Number of test samples (pieces)
全部应用apply all 87.1487.14 280555280555 7005270052
网络访问network access 99.1899.18 2385223852 61486148
系统system 99.1199.11 2412924129 58715871
下载download 97.6497.64 2394023940 60606060
网页Web page 96.9496.94 2397523975 60256025
其他other 95.6495.64 2405724057 59435943
语音voice 87.0287.02 2401124011 59895989
邮件mail 86.8086.80 1816818168 45404540
流媒体stream media 84.4284.42 2404224042 59585958
社交媒体social media 83.6383.63 2395123951 60596059
聊天chat 81.1181.11 2394723947 60536053
音乐music 74.9674.96 28502850 695695
云存储cloud storage 71.8371.83 2416724167 58335833
软件升级software upgrade 67.2867.28 1321113211 32893289
视频video 61.1761.17 62556255 15891589
根据表1可知,本发明实施例的多业务场景的识别方法的识别正确率高达87.14%。其中,6个门类的检测概率超过了95%,11个门类的检测概率超过了80%。保证了总体较高的检测概率。According to Table 1, it can be seen that the identification accuracy rate of the multi-service scene identification method in the embodiment of the present invention is as high as 87.14%. Among them, the detection probability of 6 categories exceeded 95%, and the detection probability of 11 categories exceeded 80%. An overall high probability of detection is guaranteed.
需要说明的是,本实施方式中的上述各示例均为方便理解进行的举例说明,并不对本发明实施例的技术方案构成限定。It should be noted that, the above-mentioned examples in this embodiment are illustrations for easy understanding, and do not limit the technical solutions of the embodiments of the present invention.
本发明的另一个实施例涉及一种多业务场景的识别方法,下面对本实施例的多业务场景的识别方法的实现细节进行具体的说明,以下内容仅为方便理解提供的实现细节,并非实施本方案的必须,图5是本实施例所述的多业务场景的识别方法的流程图,具体包括:Another embodiment of the present invention relates to a method for identifying multiple business scenarios. The implementation details of the method for identifying multiple business scenarios in this embodiment are described in detail below. As necessary for the solution, Figure 5 is a flow chart of the method for identifying multiple business scenarios described in this embodiment, specifically including:
步骤501,将最近预设时长内的数据流特征,更新至决策森林模型的训练样本。 Step 501, update the data flow features within the latest preset time period to the training samples of the decision forest model.
具体而言,接入点设备在一定时间后,会请求用户设备反馈最近预设时长内的业务,并将最近预设时长内获取的数据流特征作为该业务的对应的数据流特征,并将该最近预设时长内的数据流特征,更新至决策森林模型的训练样本。Specifically, after a certain period of time, the access point device will request the user equipment to feed back the services within the latest preset time period, and use the data flow characteristics obtained within the latest preset time period as the corresponding data flow characteristics of the service, and The data flow characteristics within the latest preset time period are updated to the training samples of the decision forest model.
其中,最近预设时长内的数据流特征标定有业务场景,且最近预设时长内的数据流特征的数据量,与更新前的训练样本的数据量相同,避免过度依赖最近预设时长内的数据流特征的数据流特征。Among them, the data flow characteristics within the latest preset time period are calibrated with business scenarios, and the data volume of the data flow features within the latest preset time period is the same as the data volume of the training samples before the update, so as to avoid excessive reliance on the latest preset time period The dataflow characteristic of the dataflow characteristic.
在一个例子中,对数据流特征标定业务场景可以通过以下方式实现,接入点设备在用户设备允许的情况下,将用户设备反馈的原始报文数据上传至服务器,通过服务器采用深度报文检测的方式进行业务场景标定,或者,以人工的方式进行业务场景标定。In an example, the service scenario of data stream feature calibration can be implemented in the following manner. The access point device uploads the original packet data fed back by the user equipment to the server under the permission of the user equipment, and uses deep packet inspection by the server. Calibrate the business scenario by means of a manual method, or perform the calibration of the business scenario manually.
步骤502,根据更新后的训练样本,更新决策森林模型。 Step 502, update the decision forest model according to the updated training samples.
接入点设备根据更新后的训练样本,使用最近预设时长内的数据流特征,在决策森林模型的原决策树的节点下新增一个叶节点,进行决策树的更新,从而更新决策森林模型。According to the updated training sample, the access point device uses the data flow characteristics within the latest preset time period to add a leaf node under the original decision tree node of the decision forest model to update the decision tree, thereby updating the decision forest model .
步骤503,获取待识别业务场景的报文数据的数据流特征。 Step 503, acquire the data flow characteristics of the message data of the service scenario to be identified.
其中,步骤503与步骤301大致相同,此处不再赘述。Wherein, step 503 is substantially the same as step 301 and will not be repeated here.
步骤504,将数据流特征输入更新后的决策森林模型。 Step 504, input data flow features into the updated decision forest model.
步骤505,根据决策树的识别结果,获取报文数据的业务场景。 Step 505, according to the identification result of the decision tree, obtain the business scenario of the packet data.
本实施例中,通过将最近预设时长内的数据流特征,更新至决策森林模型的训练样本,并根据更新后的训练样本,更新决策森林模型,可以对决策森林模型进行优化,进一步提升了多业务场景的识别性能。In this embodiment, the decision forest model can be optimized by updating the data flow characteristics within the latest preset time period to the training samples of the decision forest model, and updating the decision forest model according to the updated training samples, which further improves the Recognition performance of multiple business scenarios.
本发明的另一个实施例涉及一种决策森林模型的训练方法,应用于接入点设备。下面对本实施例的决策森林模型的训练方法的实现细节进行具体的说明,以下内容仅为方便理解提 供的实现细节,并非实施本方案的必须,图6是本实施例所述的决策森林模型的训练方法的流程图,具体包括:Another embodiment of the present invention relates to a method for training a decision forest model, which is applied to an access point device. The implementation details of the training method of the decision forest model of this embodiment are described in detail below. The following content is only the implementation details provided for the convenience of understanding, and is not necessary for the implementation of this solution. FIG. 6 is the implementation of the decision forest model described in this embodiment. Flowchart of the training method, specifically including:
步骤601,获取训练样本。 Step 601, acquire training samples.
具体而言,接入点设备从用户设备中获取数据流特征作为初始的决策森林模型的训练样本,其中,用户设备可以是一个,或者多个,则训练样本包括多个业务场景的数据流特征。Specifically, the access point device obtains the data flow characteristics from the user equipment as the initial training samples of the decision forest model, where there may be one or more user equipment, and the training samples include the data flow characteristics of multiple business scenarios .
其中,数据流特征包括以下之一或任意组合:源IP、源端口、目的IP、目标端口、协议类型、最大报文大小、最小报文大小、平均报文大小、报文大小的方差、最大报文交换时间、最小报文交换时间、平均报文交换时间。Among them, the data flow characteristics include one or any combination of the following: source IP, source port, destination IP, destination port, protocol type, maximum packet size, minimum packet size, average packet size, variance of packet size, maximum Message exchange time, minimum message exchange time, average message exchange time.
在具体实现中,源IP和/或目的IP为经二进制数据和无符号整形转换以及小数精度归一化的源IP和/或目的IP,源端口和/或目标端口为经小数精度归一化的源端口和/或目标端口,协议类型为根据预设的协议类型与整数的映射关系,映射得到的整数,以提升决策森林模型的识别性能。In a specific implementation, the source IP and/or destination IP is the source IP and/or destination IP converted by binary data and unsigned integer and normalized by decimal precision, and the source port and/or destination port is normalized by decimal precision The source port and/or destination port, the protocol type is the integer mapped according to the preset mapping relationship between the protocol type and the integer, so as to improve the recognition performance of the decision forest model.
在一个例子中,为了体现本发明实施例中采用进行数据处理后数据流特征进行模型训练可以提升决策森林模型的识别性能,进行了以下测试:In one example, in order to show that the recognition performance of the decision forest model can be improved by using the data stream features after data processing in the embodiment of the present invention for model training, the following tests were carried out:
具体采用的测试流程与第一实施例中的测试流程相同,此处不再赘述,不同之处在于在对初始的决策森林模型进行训练前,不对获取的数据流特征进行上述的数据处理,然后得到进行了多业务场景的识别后,对各个业务场景类别的检测概率,如表2所示:The specific test process adopted is the same as the test process in the first embodiment, and will not be repeated here. The difference is that before the initial decision forest model is trained, the above-mentioned data processing is not performed on the acquired data flow characteristics, and then After the identification of multiple business scenarios is obtained, the detection probability of each business scenario category is shown in Table 2:
表2Table 2
类别category 检测概率(%)Detection probability (%)
网络访问network access 99.2199.21
其他other 95.5795.57
下载download 95.5695.56
系统system 95.1195.11
语音voice 85.1485.14
网页Web page 84.2184.21
全部业务all business 83.8383.83
邮件mail 80.7380.73
聊天chat 79.8879.88
流媒体stream media 79.8079.80
音乐music 79.0979.09
社交媒体social media 76.3976.39
云存储cloud storage 71.7771.77
软件升级software upgrade 69.2369.23
视频video 58.4258.42
根据表2可知,总体的识别概率从表1中的87.14%降低到83.83%。检测概率在95%以上的业务类别,从6个减少到4个,检测概率在80%以上的业务类别,从11个减少到8个。其中,个别业务,比如社交媒体业务,检测概率的下降率高达7.24%。According to Table 2, the overall recognition probability is reduced from 87.14% in Table 1 to 83.83%. The number of business categories with a detection probability of more than 95% has been reduced from 6 to 4, and the number of business categories with a detection probability of more than 80% has been reduced from 11 to 8. Among them, individual businesses, such as social media business, have a drop rate of detection probability as high as 7.24%.
因此,在进行决策森林模型的训练时,训练样本采用处理后的数据流特征,可以有效地 提升业务识别性能。Therefore, when training the decision forest model, the training samples use the processed data flow features, which can effectively improve the business recognition performance.
步骤602,根据训练样本对初始的决策森林模型进行训练,得到训练好的决策森林模型。Step 602: Train the initial decision forest model according to the training samples to obtain a trained decision forest model.
具体而言,接入点设备根据获取的训练样本,采用梯度提升决策树方法训练初始的决策森林模型,以得到训练好的决策森林模型。其中,决策森林模型包括N个决策树,决策树用于识别数据流特征的业务场景;N为大于0的自然数。Specifically, the access point device uses the gradient boosting decision tree method to train the initial decision forest model according to the obtained training samples, so as to obtain a trained decision forest model. Among them, the decision forest model includes N decision trees, which are used to identify business scenarios of data flow characteristics; N is a natural number greater than 0.
具体的训练流程为:首先定义x i为第i个训练数据,定义y i为第i个训练数据对应的业务场景类别,并定义
Figure PCTCN2022118249-appb-000007
为模型对于第i个训练数据的预测结果。
The specific training process is as follows: first define xi as the i-th training data, define y i as the business scenario category corresponding to the i-th training data, and define
Figure PCTCN2022118249-appb-000007
is the prediction result of the model for the i-th training data.
其中,T是决策树的数量,f t是第t个决策树的函数,t=1,...,T,以确定在决策森林模型的识别准确率最高时,决策树的训练参数。 Wherein, T is the number of decision trees, f t is the function of the tth decision tree, t=1, ..., T, to determine the training parameters of the decision tree when the recognition accuracy of the decision forest model is the highest.
定义θ t为f t的参数,即在决策森林模型的识别准确率最高时,决策树的训练参数,可以通过下面的计算公式得到最优的训练参数
Figure PCTCN2022118249-appb-000008
Define θ t as the parameter of f t , that is, when the recognition accuracy of the decision forest model is the highest, the training parameters of the decision tree can be obtained by the following calculation formula
Figure PCTCN2022118249-appb-000008
Figure PCTCN2022118249-appb-000009
Figure PCTCN2022118249-appb-000009
在一个例子中,对训练样本进行T次迭代,每次迭代中,在上一次迭代所使用的训练参数的基础上,新增一个使得目标函数
Figure PCTCN2022118249-appb-000010
下降的树,并定义
Figure PCTCN2022118249-appb-000011
为第t次迭代更新的模型对于第i个训练数据的预测结果。在第t次迭代时,采用下面的计算方式得到
Figure PCTCN2022118249-appb-000012
In one example, the training sample is iterated for T times, and in each iteration, based on the training parameters used in the previous iteration, a new one is added such that the objective function
Figure PCTCN2022118249-appb-000010
descending tree, and define
Figure PCTCN2022118249-appb-000011
The prediction result of the model updated for the t-th iteration for the i-th training data. At the tth iteration, use the following calculation method to get
Figure PCTCN2022118249-appb-000012
Figure PCTCN2022118249-appb-000013
Figure PCTCN2022118249-appb-000013
本实施例中,通过在进行决策森林模型的训练时,获取最优的训练参数,即决策树的训练参数,例如,决策树的数量,可以提升决策森林模型的识别准确率。In this embodiment, the recognition accuracy of the decision forest model can be improved by obtaining the optimal training parameters, that is, the training parameters of the decision tree, for example, the number of decision trees, during the training of the decision forest model.
上面各种方法的步骤划分,只是为了描述清楚,实现时可以合并为一个步骤或者对某些步骤进行拆分,分解为多个步骤,只要包括相同的逻辑关系,都在本专利的保护范围内;对算法中或者流程中添加无关紧要的修改或者引入无关紧要的设计,但不改变其算法和流程的核心设计都在该专利的保护范围内。The step division of the above various methods is only for the sake of clarity of description. During implementation, it can be combined into one step or some steps can be split and decomposed into multiple steps. As long as they include the same logical relationship, they are all within the scope of protection of this patent. ; Adding insignificant modifications or introducing insignificant designs to the algorithm or process, but not changing the core design of the algorithm and process are all within the scope of protection of this patent.
本发明的另一个实施例涉及一种多业务场景的识别装置,下面对本实施例的多业务场景的识别装置的细节进行具体的说明,以下内容仅为方便理解提供的实现细节,并非实施本例的必须,图7是本实施例所述的多业务场景的识别装置的示意图,包括:特征获取模块701、输入模块702和场景获取模块703。Another embodiment of the present invention relates to an identification device for multi-service scenarios. The details of the identification device for multi-service scenarios in this embodiment will be described in detail below. The following content is only the implementation details provided for the convenience of understanding, not the implementation of this embodiment As necessary, FIG. 7 is a schematic diagram of an apparatus for identifying multiple business scenarios described in this embodiment, including: a feature acquisition module 701 , an input module 702 and a scenario acquisition module 703 .
具体而言,特征获取模块701,设置为获取待识别业务场景的报文数据的数据流特征。Specifically, the feature acquisition module 701 is configured to acquire the data flow features of the message data of the service scenario to be identified.
输入模块702,设置为将数据流特征输入预先训练好的决策森林模型;其中,决策森林模型包括N个决策树,决策树设置为识别数据流特征的业务场景;N为大于0的自然数,决策森林模型的训练样本包括多个业务场景的数据流特征。The input module 702 is configured to input data flow characteristics into a pre-trained decision forest model; wherein, the decision forest model includes N decision trees, and the decision trees are set to identify business scenarios of data flow characteristics; N is a natural number greater than 0, and the decision The training samples of the forest model include data flow characteristics of multiple business scenarios.
在一个例子中,输入模块702还设置为在数据流特征包括源IP和/或目的IP,将源IP和/或目的IP的地址转换为二进制数据;将二进制数据转化为无符号整形;对转化为无符号整形的二进制数据进行小数精度归一化,得到归一化后的用于输入决策森林模型的源IP和/或目的IP之后,将归一化后的源IP和/或目的IP输入预先训练好的决策森林模型。In one example, the input module 702 is also configured to convert the address of the source IP and/or the destination IP into binary data when the data flow characteristics include the source IP and/or the destination IP; convert the binary data into an unsigned integer; Perform decimal precision normalization for unsigned integer binary data, and after obtaining the normalized source IP and/or destination IP for input into the decision forest model, input the normalized source IP and/or destination IP Pre-trained decision forest models.
在一个例子中,输入模块702还设置为在数据流特征包括源端口和/或目标端口,对源端口和/或目标端口进行小数精度归一化,得到归一化后的用于输入决策森林模型的源端口和/ 或目标端口之后,将归一化后的源端口和/或目标端口输入预先训练好的决策森林模型。In an example, the input module 702 is also configured to perform decimal precision normalization on the source port and/or the target port when the data flow characteristics include the source port and/or the target port, and obtain the normalized input decision forest After the source port and/or target port of the model, the normalized source port and/or target port are input into the pre-trained decision forest model.
在一个例子中,输入模块702还设置为在数据流特征包括在协议类型,根据预设的协议类型与整数的映射关系,将当前获取的协议类型映射到对应的整数之后,将对应的整数输入预先训练好的决策森林模型。In one example, the input module 702 is also configured to input the corresponding integer after the protocol type currently obtained is mapped to the corresponding integer according to the preset mapping relationship between the protocol type and the integer after the data flow feature is included in the protocol type. Pre-trained decision forest models.
场景获取模块703,设置为根据N个决策树的识别结果,获取报文数据的业务场景。The scenario acquisition module 703 is configured to acquire the business scenario of the message data according to the identification results of the N decision trees.
在一个例子中,场景获取模块703还设置为根据N个决策树的识别结果,将识别到的相同类别的业务场景的权重进行累加,得到识别到的各业务场景的权重累加值;将权重累加值最高的业务场景作为报文数据的业务场景。In one example, the scenario acquisition module 703 is also configured to accumulate the weights of identified business scenarios of the same category according to the identification results of the N decision trees to obtain the weight accumulation value of each identified business scenario; the weights are accumulated The business scenario with the highest value is used as the business scenario of the packet data.
本发明的另一个实施例涉及一种决策森林模型的训练装置,下面对本实施例的决策森林模型的训练装置的细节进行具体的说明,以下内容仅为方便理解提供的实现细节,并非实施本例的必须,图8是本实施例所述的决策森林模型的训练装置的示意图,包括:样本获取模块801和训练模块802。Another embodiment of the present invention relates to a training device for a decision forest model. The details of the training device for a decision forest model in this embodiment are described in detail below. The following content is only an implementation detail provided for easy understanding, and is not an implementation of this embodiment As necessary, FIG. 8 is a schematic diagram of a training device for the decision forest model described in this embodiment, including: a sample acquisition module 801 and a training module 802 .
具体而言,样本获取模块801,设置为获取训练样本,其中,训练样本包括多个业务场景的数据流特征。Specifically, the sample acquisition module 801 is configured to acquire training samples, wherein the training samples include data flow characteristics of multiple business scenarios.
训练模块802,设置为根据训练样本对初始的决策森林模型进行训练,得到训练好的决策森林模型;其中,决策森林模型包括N个决策树,决策树用于识别数据流特征的业务场景;N为大于0的自然数。The training module 802 is set to train the initial decision forest model according to the training samples to obtain the trained decision forest model; wherein, the decision forest model includes N decision trees, and the decision trees are used to identify business scenarios of data flow characteristics; N is a natural number greater than 0.
不难发现,本实施例为与上述多业务场景的识别方法实施例对应的装置实施例,本实施例可以与上述方法实施例互相配合实施。上述实施例中提到的相关技术细节和技术效果在本实施例中依然有效,为了减少重复,这里不再赘述。相应地,本实施例中提到的相关技术细节也可应用在上述实施例中。It is not difficult to find that this embodiment is an apparatus embodiment corresponding to the above embodiment of the method for identifying multiple service scenarios, and this embodiment can be implemented in cooperation with the above method embodiment. The relevant technical details and technical effects mentioned in the above embodiments are still valid in this embodiment, and will not be repeated here to reduce repetition. Correspondingly, the relevant technical details mentioned in this embodiment can also be applied in the above embodiments.
值得一提的是,上述两个实施例中所涉及到的各模块均为逻辑模块,在实际应用中,一个逻辑单元可以是一个物理单元,也可以是一个物理单元的一部分,还可以以多个物理单元的组合实现。此外,为了突出本发明实施例的创新部分,本实施例中并没有将与解决本发明实施例所提出的技术问题关系不太密切的单元引入,但这并不表明本实施例中不存在其它的单元。It is worth mentioning that all the modules involved in the above two embodiments are logical modules. In practical applications, a logical unit can be a physical unit or a part of a physical unit, and can also Combination of physical units. In addition, in order to highlight the innovative part of the embodiment of the present invention, this embodiment does not introduce units that are not closely related to solving the technical problems raised by the embodiment of the present invention, but this does not mean that there are no other elements in this embodiment unit.
本发明另一个实施例涉及一种电子设备,如图9所示,包括:至少一个处理器901;以及,与所述至少一个处理器901通信连接的存储器902;其中,所述存储器902存储有可被所述至少一个处理器901执行的指令,所述指令被所述至少一个处理器901执行,以使所述至少一个处理器901能够执行上述各实施例中的多业务场景的识别方法和决策森林模型的训练方法。Another embodiment of the present invention relates to an electronic device, as shown in FIG. 9 , including: at least one processor 901; and a memory 902 communicatively connected to the at least one processor 901; wherein, the memory 902 stores Instructions that can be executed by the at least one processor 901, the instructions are executed by the at least one processor 901, so that the at least one processor 901 can execute the multi-service scene identification method and A method for training a decision forest model.
其中,存储器和处理器采用总线方式连接,总线可以包括任意数量的互联的总线和桥,总线将一个或多个处理器和存储器的各种电路连接在一起。总线还可以将诸如外围设备、稳压器和功率管理电路等之类的各种其他电路连接在一起,这些都是本领域所公知的,因此,本文不再对其进行进一步描述。总线接口在总线和收发机之间提供接口。收发机可以是一个元件,也可以是多个元件,比如多个接收器和发送器,提供用于在传输介质上与各种其他装置通信的单元。经处理器处理的数据通过天线在无线介质上进行传输,进一步,天线还接收数据并将数据传送给处理器。Wherein, the memory and the processor are connected by a bus, and the bus may include any number of interconnected buses and bridges, and the bus connects one or more processors and various circuits of the memory together. The bus may also connect together various other circuits such as peripherals, voltage regulators, and power management circuits, all of which are well known in the art and therefore will not be further described herein. The bus interface provides an interface between the bus and the transceivers. A transceiver may be a single element or multiple elements, such as multiple receivers and transmitters, providing means for communicating with various other devices over a transmission medium. The data processed by the processor is transmitted on the wireless medium through the antenna, further, the antenna also receives the data and transmits the data to the processor.
处理器负责管理总线和通常的处理,还可以提供各种功能,包括定时,外围接口,电压 调节、电源管理以及其他控制功能。而存储器可以被用于存储处理器在执行操作时所使用的数据。The processor manages the bus and general processing, and can also provide various functions including timing, peripheral interfacing, voltage regulation, power management, and other control functions. Instead, memory can be used to store data that the processor uses when performing operations.
本发明另一个实施例涉及一种计算机可读存储介质,存储有计算机程序。计算机程序被处理器执行时实现上述方法实施例。Another embodiment of the present invention relates to a computer-readable storage medium storing a computer program. The above method embodiments are implemented when the computer program is executed by the processor.
即,本领域技术人员可以理解,实现上述实施例方法中的全部或部分步骤是可以通过程序来指令相关的硬件来完成,该程序存储在一个存储介质中,包括若干指令用以使得一个设备(可以是单片机,芯片等)或处理器(processor)执行本发明各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。That is, those skilled in the art can understand that all or part of the steps in the method of the above-mentioned embodiments can be completed by instructing related hardware through a program, the program is stored in a storage medium, and includes several instructions to make a device ( It may be a single-chip microcomputer, a chip, etc.) or a processor (processor) to execute all or part of the steps of the methods described in the various embodiments of the present invention. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disk or optical disc, etc., which can store program codes. .
本领域的普通技术人员可以理解,上述各实施方式是实现本发明实施例的具体实施例,而在实际应用中,可以在形式上和细节上对其作各种改变,而不偏离本发明实施例的精神和范围。Those of ordinary skill in the art can understand that the above-mentioned implementation modes are specific examples for implementing the embodiments of the present invention, and in practical applications, various changes can be made to them in form and details without departing from the implementation of the present invention. spirit and scope of the example.

Claims (14)

  1. 一种多业务场景的识别方法,包括:A method for identifying multiple business scenarios, comprising:
    获取待识别业务场景的报文数据的数据流特征;Obtain the data flow characteristics of the message data of the business scene to be identified;
    将所述数据流特征输入预先训练好的决策森林模型;所述决策森林模型包括N个决策树,所述决策树用于识别数据流特征的业务场景;所述N为大于0的自然数,所述决策森林模型的训练样本包括多个业务场景的数据流特征;The data flow characteristics are input into a pre-trained decision forest model; the decision forest model includes N decision trees, and the decision trees are used to identify business scenarios of data flow characteristics; the N is a natural number greater than 0, so The training samples of the above decision forest model include data flow characteristics of multiple business scenarios;
    根据所述N个决策树的识别结果,获取所述报文数据的业务场景。According to the identification results of the N decision trees, the business scenario of the packet data is acquired.
  2. 根据权利要求1所述的多业务场景的识别方法,其中,所述决策树的识别结果包括识别到的业务场景和所述识别到的业务场景的权重;The method for identifying multiple business scenarios according to claim 1, wherein the identification result of the decision tree includes the identified business scenarios and the weight of the identified business scenarios;
    所述根据所述N个决策树的识别结果,获取所述报文数据的业务场景,包括:The business scenario of obtaining the message data according to the identification results of the N decision trees includes:
    根据所述N个决策树的识别结果,将识别到的相同类别的业务场景的权重进行累加,得到识别到的各业务场景的权重累加值;According to the identification results of the N decision trees, the weights of the identified business scenarios of the same category are accumulated to obtain the weight accumulation value of each identified business scenario;
    将所述权重累加值最高的业务场景作为所述报文数据的业务场景。The business scenario with the highest accumulated weight value is used as the business scenario of the packet data.
  3. 根据权利要求1所述的多业务场景的识别方法,其中,所述数据流特征包括以下之一或其任意组合:源IP、源端口、目的IP、目标端口、协议类型、最大报文大小、最小报文大小、平均报文大小、报文大小的方差、最大报文交换时间、最小报文交换时间、平均报文交换时间。The method for identifying multi-service scenarios according to claim 1, wherein the data flow characteristics include one of the following or any combination thereof: source IP, source port, destination IP, destination port, protocol type, maximum packet size, Minimum packet size, average packet size, variance of packet size, maximum packet exchange time, minimum packet exchange time, average packet exchange time.
  4. 根据权利要求3所述的多业务场景的识别方法,其中,当所述数据流特征包括源IP和/或目的IP,在所述获取待识别业务场景的报文数据的数据流特征后,将所述数据流特征输入预先训练好的决策森林模型之前,还包括:The method for identifying multiple business scenarios according to claim 3, wherein, when the data flow characteristics include source IP and/or destination IP, after obtaining the data flow characteristics of the message data of the business scenario to be identified, the Before the data flow features are input into the pre-trained decision forest model, it also includes:
    将源IP和/或目的IP的地址转换为二进制数据;Convert the address of source IP and/or destination IP into binary data;
    将所述二进制数据转化为无符号整形;Convert the binary data into an unsigned integer;
    对转化为无符号整形的所述二进制数据进行小数精度归一化,得到归一化后的用于输入所述决策森林模型的源IP和/或目的IP;Perform decimal precision normalization on the binary data converted into an unsigned integer to obtain a normalized source IP and/or destination IP for inputting into the decision forest model;
    当所述数据流特征包括源端口和/或目标端口,在所述获取待识别业务场景的报文数据的数据流特征后,将所述数据流特征输入预先训练好的决策森林模型之前,还包括:When the data flow characteristics include the source port and/or the destination port, after the data flow characteristics of the message data of the service scene to be identified are obtained, before the data flow characteristics are input into the pre-trained decision forest model, the include:
    对源端口和/或目标端口进行小数精度归一化,得到归一化后的用于输入所述决策森林模型的源端口和/或目标端口。Decimal precision normalization is performed on the source port and/or the target port to obtain a normalized source port and/or target port for inputting into the decision forest model.
  5. 根据权利要求3所述的多业务场景的识别方法,其中,当所述数据流特征包括协议类型,在所述获取待识别业务场景的报文数据的数据流特征后,将所述数据流特征输入预先训练好的决策森林模型之前,还包括:The method for identifying multiple business scenarios according to claim 3, wherein, when the data flow feature includes a protocol type, after the acquisition of the data flow feature of the message data of the service scene to be identified, the data flow feature Before inputting the pre-trained decision forest model, also include:
    根据预设的协议类型与整数的映射关系,将当前获取的协议类型映射到对应的整数,并将所述对应的整数作为输入所述决策森林模型的协议类型。According to the preset mapping relationship between protocol types and integers, the currently acquired protocol type is mapped to a corresponding integer, and the corresponding integer is used as the protocol type input into the decision forest model.
  6. 根据权利要求1至5中任一项所述的多业务场景的识别方法,其中,在所述根据所述 N个决策树的识别结果,获取所述报文数据的业务场景后,还包括:The method for identifying multiple business scenarios according to any one of claims 1 to 5, wherein, after obtaining the business scenarios of the message data according to the identification results of the N decision trees, further comprising:
    将最近预设时长内的数据流特征,更新至所述训练样本,其中,所述最近预设时长内的数据流特征标定有业务场景,所述最近预设时长内的数据流特征的数据量,与更新前的所述训练样本的数据量相同;Updating the data flow characteristics within the latest preset time length to the training samples, wherein the data flow characteristics within the latest preset time length are marked with business scenarios, and the data volume of the data flow characteristics within the latest preset time length , which is the same as the data volume of the training sample before updating;
    根据更新后的所述训练样本,更新所述决策森林模型。The decision forest model is updated according to the updated training samples.
  7. 根据权利要求1至5中任一项所述的多业务场景的识别方法,其中,所述决策森林模型采用梯度提升决策树方法训练得到。The method for identifying multiple business scenarios according to any one of claims 1 to 5, wherein the decision forest model is trained by using a gradient boosting decision tree method.
  8. 一种决策森林模型的训练方法,包括:A training method for a decision forest model, comprising:
    获取训练样本,所述训练样本包括多个业务场景的数据流特征;Acquiring training samples, the training samples include data flow characteristics of multiple business scenarios;
    根据所述训练样本对初始的决策森林模型进行训练,得到训练好的决策森林模型;Training the initial decision forest model according to the training samples to obtain the trained decision forest model;
    其中,所述决策森林模型包括N个决策树,所述决策树用于识别数据流特征的业务场景;所述N为大于0的自然数。Wherein, the decision forest model includes N decision trees, and the decision trees are used to identify business scenarios of data flow characteristics; the N is a natural number greater than 0.
  9. 根据权利要求8所述的决策森林模型的训练方法,其中,所述数据流特征包括以下之一或其任意组合:源IP、源端口、目的IP、目标端口、协议类型、最大报文大小、最小报文大小、平均报文大小、报文大小的方差、最大报文交换时间、最小报文交换时间、平均报文交换时间。The training method of the decision forest model according to claim 8, wherein the data flow characteristics include one of the following or any combination thereof: source IP, source port, destination IP, destination port, protocol type, maximum packet size, Minimum packet size, average packet size, variance of packet size, maximum packet exchange time, minimum packet exchange time, average packet exchange time.
  10. 根据权利要求9所述的决策森林模型的训练方法,其中,所述源IP和/或目的IP为经二进制数据和无符号整形转换以及小数精度归一化的源IP和/或目的IP;The training method of the decision forest model according to claim 9, wherein, the source IP and/or the destination IP are the source IP and/or the destination IP through binary data and unsigned integer conversion and decimal precision normalization;
    所述源端口和/或目标端口为经小数精度归一化的源端口和/或目标端口;The source port and/or target port are source ports and/or target ports normalized by decimal precision;
    所述协议类型为根据预设的协议类型与整数的映射关系,映射得到的整数。The protocol type is an integer obtained by mapping according to a preset mapping relationship between protocol types and integers.
  11. 一种多业务场景的识别装置,包括:An identification device for multiple business scenarios, comprising:
    特征获取模块,设置为获取待识别业务场景的报文数据的数据流特征;The feature acquisition module is configured to acquire the data flow features of the message data of the business scene to be identified;
    输入模块,设置为将所述数据流特征输入预先训练好的决策森林模型;所述决策森林模型包括N个决策树,所述决策树用于识别数据流特征的业务场景;所述N为大于0的自然数,所述决策森林模型的训练样本包括多个业务场景的数据流特征;The input module is configured to input the data flow characteristics into a pre-trained decision forest model; the decision forest model includes N decision trees, and the decision trees are used to identify business scenarios of data flow characteristics; the N is greater than A natural number of 0, the training samples of the decision forest model include data flow characteristics of multiple business scenarios;
    场景获取模块,设置为根据所述N个决策树的识别结果,获取所述报文数据的业务场景。The scenario acquiring module is configured to acquire the business scenario of the message data according to the identification results of the N decision trees.
  12. 一种决策森林模型的训练装置,包括:A training device for a decision forest model, comprising:
    样本获取模块,设置为获取训练样本,所述训练样本包括多个业务场景的数据流特征;The sample acquisition module is configured to acquire training samples, the training samples include data flow characteristics of multiple business scenarios;
    训练模块,设置为根据所述训练样本对初始的决策森林模型进行训练,得到训练好的决策森林模型;The training module is configured to train the initial decision forest model according to the training samples to obtain the trained decision forest model;
    其中,所述决策森林模型包括N个决策树,所述决策树用于识别数据流特征的业务场景;所述N为大于0的自然数。Wherein, the decision forest model includes N decision trees, and the decision trees are used to identify business scenarios of data flow characteristics; the N is a natural number greater than 0.
  13. 一种电子设备,包括:An electronic device comprising:
    至少一个处理器;以及,at least one processor; and,
    与所述至少一个处理器通信连接的存储器;其中,a memory communicatively coupled to the at least one processor; wherein,
    所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行如权利要求1至7中任一项所述的多业务场景的识别方法,或者,执行如权利要求8至10中任一项所述的决策森林模型的训练方法。The memory stores instructions executable by the at least one processor, the instructions are executed by the at least one processor, so that the at least one processor can perform the operation described in any one of claims 1 to 7 The method for identifying multiple business scenarios described above, or, execute the method for training a decision forest model as described in any one of claims 8 to 10.
  14. 一种计算机可读存储介质,存储有计算机程序,其中,所述计算机程序被处理器执行时实现权利要求1至7中任一项所述的多业务场景的识别方法,或者,实现权利要求8至10中任一项所述的决策森林模型的训练方法。A computer-readable storage medium storing a computer program, wherein, when the computer program is executed by a processor, the method for identifying multiple business scenarios according to any one of claims 1 to 7 is realized, or claim 8 is realized To the training method of the decision forest model described in any one of 10.
PCT/CN2022/118249 2021-12-03 2022-09-09 Multi-service scenario identification method and decision forest model training method WO2023098222A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111467644.8A CN116304650A (en) 2021-12-03 2021-12-03 Multi-service scene identification method and decision forest model training method
CN202111467644.8 2021-12-03

Publications (1)

Publication Number Publication Date
WO2023098222A1 true WO2023098222A1 (en) 2023-06-08

Family

ID=86611491

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/118249 WO2023098222A1 (en) 2021-12-03 2022-09-09 Multi-service scenario identification method and decision forest model training method

Country Status (2)

Country Link
CN (1) CN116304650A (en)
WO (1) WO2023098222A1 (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109951444A (en) * 2019-01-29 2019-06-28 中国科学院信息工程研究所 A kind of encryption Anonymizing networks method for recognizing flux
CN110460488A (en) * 2019-07-01 2019-11-15 华为技术有限公司 Business stream recognition method and device, model generating method and device
US20190394527A1 (en) * 2018-06-22 2019-12-26 Samsung Electronics Co., Ltd. Machine learning based packet service classification methods for experience-centric cellular scheduling
CN111245667A (en) * 2018-11-28 2020-06-05 中国移动通信集团浙江有限公司 Network service identification method and device
CN112532466A (en) * 2019-09-17 2021-03-19 华为技术有限公司 Flow identification method and device and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190394527A1 (en) * 2018-06-22 2019-12-26 Samsung Electronics Co., Ltd. Machine learning based packet service classification methods for experience-centric cellular scheduling
CN111245667A (en) * 2018-11-28 2020-06-05 中国移动通信集团浙江有限公司 Network service identification method and device
CN109951444A (en) * 2019-01-29 2019-06-28 中国科学院信息工程研究所 A kind of encryption Anonymizing networks method for recognizing flux
CN110460488A (en) * 2019-07-01 2019-11-15 华为技术有限公司 Business stream recognition method and device, model generating method and device
CN112532466A (en) * 2019-09-17 2021-03-19 华为技术有限公司 Flow identification method and device and storage medium

Also Published As

Publication number Publication date
CN116304650A (en) 2023-06-23

Similar Documents

Publication Publication Date Title
CN111865815B (en) Flow classification method and system based on federal learning
CN107181724B (en) Identification method and system of cooperative flow and server using method
WO2021000874A1 (en) Service flow identification method and apparatus, and model generation method and apparatus
WO2019184640A1 (en) Indicator determination method and related device thereto
WO2021169308A1 (en) Data stream type identification model updating method and related device
CN109672980B (en) Method, device and storage medium for determining wireless local area network hotspot corresponding to interest point
CN113765857A (en) Message forwarding method, device, equipment and storage medium
CN111953552B (en) Data flow classification method and message forwarding equipment
CN113452676B (en) Detector distribution method and Internet of things detection system
CN108462707B (en) Mobile application identification method based on deep learning sequence analysis
CN111355671B (en) Network traffic classification method, medium and terminal equipment based on self-attention mechanism
WO2023098222A1 (en) Multi-service scenario identification method and decision forest model training method
CN113127693B (en) Traffic data packet statistics method, device, equipment and storage medium
CN112822208A (en) Internet of things equipment identification method and system based on block chain
CN110224932B (en) Method and system for rapidly forwarding data
CN112468324A (en) Graph convolution neural network-based encrypted traffic classification method and device
WO2015192572A1 (en) Method, apparatus and system for configuring quality of service (qos) parameters
CN115866582A (en) Equipment identification method, device, equipment and storage medium
CN112839051B (en) Encryption flow real-time classification method and device based on convolutional neural network
WO2022012429A1 (en) Method for implementing terminal verification, apparatus, system, device, and storage medium
CN113660174B (en) Service type determining method and related equipment
CN111711946B (en) IoT (internet of things) equipment identification method and identification system under encrypted wireless network
WO2022179352A1 (en) Acquisition cycle determining method, apparatus and system, device, and storage medium
CN110943973B (en) Data stream classification method and device, model training method and device and storage medium
CN116192997B (en) Event detection method and system based on network flow

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22900030

Country of ref document: EP

Kind code of ref document: A1