CN115811430A

CN115811430A - Data stream identification method, device, equipment and storage medium

Info

Publication number: CN115811430A
Application number: CN202211527660.6A
Authority: CN
Inventors: 蒋荣; 孙乐; 郑威
Original assignee: Nanjing Zhongfu Information Technology Co Ltd
Current assignee: Nanjing Zhongfu Information Technology Co Ltd
Priority date: 2022-11-30
Filing date: 2022-11-30
Publication date: 2023-03-17

Abstract

The application provides a data stream identification method, a device, equipment and a storage medium, and relates to the technical field of computer network security. The method comprises the following steps: determining a data stream where each data packet is located according to received first packet header information of each data packet to obtain at least one data stream to be identified, wherein each data stream to be identified comprises at least one data packet, and the first packet header information of each data packet in the same data stream to be identified is consistent; obtaining characteristic information of the data stream to be identified according to the second packet header information of each data packet in the data stream to be identified; and inputting the characteristic information of the data stream to be identified into a target flow identification model obtained by pre-training, and predicting whether the data stream to be identified is the onion routing data stream. By applying the embodiment of the application, the accuracy of identifying the Tor data stream can be improved.

Description

Data stream identification method, device, equipment and storage medium

Technical Field

The present application relates to the field of computer network security technologies, and in particular, to a data stream identification method, apparatus, device, and storage medium.

Background

Tor (The onion routing) in The darknet is currently The most widely used anonymous communication tool by virtue of its low latency and high security, but is largely used for illegal applications. Therefore, there is a need to accurately identify the Tor data flow when communicating over the network, so that the Tor data flow can be managed and controlled in a timely manner.

Currently, tor data flows are primarily identified based on specific port numbers and/or protocol characteristics.

However, the communication transmission may not be limited to only one specific port, nor to the network transmission protocol of a certain characteristic information, resulting in poor accuracy of Tor data stream identification in the prior art.

Disclosure of Invention

The present application aims to provide a data stream identification method, apparatus, device and storage medium, which can improve the accuracy of identifying Tor data stream, aiming at the defects in the prior art.

In order to achieve the above purpose, the embodiments of the present application adopt the following technical solutions:

in a first aspect, an embodiment of the present application provides a data stream identification method, where the method includes:

determining a data stream in which each data packet is located according to received first packet header information of each data packet to obtain at least one data stream to be identified, wherein each data stream to be identified comprises at least one data packet, the first packet header information of each data packet in the same data stream to be identified is consistent, and the first packet header information comprises: user IP, network IP, user port, network port and protocol identification;

obtaining feature information of the data stream to be identified according to second header information of each data packet in the data stream to be identified, where the second header information: data length and arrival time, wherein the characteristic information of the data stream to be identified comprises: run information, uplink load information, downlink load information and time interval information of the data stream to be identified;

and inputting the characteristic information of the data stream to be identified into a target flow identification model obtained by pre-training, and predicting whether the data stream to be identified is an onion routing data stream.

Optionally, the obtaining, according to the second header information of each data packet in the data stream to be identified, the feature information of the data stream to be identified includes:

obtaining the payload length and the arrival time of each data packet according to the second header information of each data packet in the data stream to be identified;

and extracting the characteristic information of the data stream to be identified according to the effective load length and the arrival time of each data packet and the extraction strategy corresponding to each type of characteristic information.

Optionally, the extracting the feature information of the data stream to be identified according to the payload length and the arrival time of each data packet and the extraction policy corresponding to each type of feature information includes:

screening at least one first data packet associated with a run from all data packets of the data stream to be identified, and determining the run information of the data stream to be identified according to the payload length of each first data packet;

screening out at least one second data packet and at least one third data packet which are respectively associated with an uplink message and a downlink message from all data packets of the data stream to be identified, and determining uplink load information and downlink load information of the data stream to be identified according to the effective load length of each second data packet and the effective load length of each third data packet;

and determining the time interval information of the data stream to be identified according to the arrival time of all data packets of the data stream to be identified.

carrying out protocol identification on the load information of each data packet in the data stream to be identified to obtain a protocol identification result, wherein the protocol identification result comprises: recognizable and unrecognizable;

and if the protocol identification result is that the data flow to be identified can not be identified, performing randomness detection on the data flow to be identified to obtain a random value, and obtaining the characteristic information of the data flow to be identified according to the random value and the second packet header information of each data packet.

Optionally, the obtaining, according to the random value and the second header information of each data packet, the feature information of the data stream to be identified includes:

and if the random value is greater than or equal to a preset threshold value, obtaining the characteristic information of the data stream to be identified according to the random value and the second packet header information of each data packet.

Optionally, the method further comprises:

and if the data stream to be identified is an onion routing data stream, sending reminding information to a preset port, wherein the reminding information comprises a user IP (Internet protocol) of the data stream to be identified.

Optionally, before the feature information of the data stream to be recognized is input into a target traffic recognition model obtained through pre-training and whether the data stream to be recognized is an onion routing data stream is recognized, the method further includes:

obtaining a plurality of sample data packets respectively included in each sample data stream from a plurality of preset application terminals; obtaining characteristic information of each sample data stream according to second header information of each sample data packet in each sample data stream;

determining a label of each sample data stream according to the identifier of each preset application end;

constructing a training sample according to the characteristic information and the label of each sample data stream;

and inputting the training sample into an initial flow identification model for training to obtain the target flow identification model.

In a second aspect, an embodiment of the present application further provides a data stream identification apparatus, where the apparatus includes:

a first determining module, configured to determine, according to first packet header information of each received data packet, a data stream in which each data packet is located, to obtain at least one data stream to be identified, where each data stream to be identified includes at least one data packet, and the first packet header information of each data packet in the same data stream is consistent, where the first packet header information includes: user IP, network IP, user port, network port and protocol identification;

a second determining module, configured to obtain feature information of the data stream to be identified according to second header information of each data packet in the data stream to be identified, where the second header information: the data length and the arrival time, and the characteristic information of the data stream to be identified comprises: run information, uplink load information, downlink load information and time interval information of the data stream to be identified;

and the identification module is used for inputting the characteristic information of the data stream to be identified into a target flow identification model obtained by pre-training and identifying whether the data stream to be identified is an onion routing data stream.

Optionally, the second determining module is specifically configured to obtain a payload length and an arrival time of each data packet according to second header information of each data packet in the data stream to be identified; and extracting the characteristic information of the data stream to be identified according to the effective load length and the arrival time of each data packet and the extraction strategy corresponding to each type of characteristic information.

Optionally, the second determining module is further specifically configured to screen at least one first data packet associated with a run from all data packets of the data stream to be identified, and determine the run information of the data stream to be identified according to a payload length of each first data packet; screening out at least one second data packet and at least one third data packet which are respectively associated with an uplink message and a downlink message from all data packets of the data stream to be identified, and determining uplink load information and downlink load information of the data stream to be identified according to the effective load length of each second data packet and the effective load length of each third data packet; and determining the time interval information of the data stream to be identified according to the arrival time of all data packets of the data stream to be identified.

Optionally, the second determining module is further specifically configured to perform protocol identification on the load information of each data packet in the data stream to be identified, so as to obtain a protocol identification result, where the protocol identification result includes: recognizable and unrecognizable; and if the protocol identification result is unidentifiable, performing randomness detection on the data stream to be identified to obtain a random value, and obtaining the characteristic information of the data stream to be identified according to the random value and the second packet header information of each data packet.

Optionally, the second determining module is further specifically configured to, if the random value is greater than or equal to a preset threshold, obtain the feature information of the data stream to be identified according to the random value and the second header information of each data packet.

Optionally, the apparatus comprises: a sending module;

and the sending module is used for sending reminding information to a preset port if the data stream to be identified is an onion routing data stream, wherein the reminding information comprises a user IP (Internet protocol) of the data stream to be identified.

Optionally, the apparatus further comprises:

the acquisition module is used for acquiring a plurality of sample data packets respectively included in each sample data stream from a plurality of preset application terminals;

the second determining module is further configured to obtain feature information of each sample data stream according to second header information of each sample data packet in each sample data stream;

the second determining module is further configured to determine a label of each sample data stream according to the identifier of each preset application end;

the construction module is used for constructing training samples according to the characteristic information and the labels of the sample data streams;

and the training module is used for inputting the training sample into an initial flow identification model for training to obtain the target flow identification model.

In a third aspect, an embodiment of the present application provides an electronic device, including: a processor, a storage medium and a bus, wherein the storage medium stores machine-readable instructions executable by the processor, when the electronic device runs, the processor communicates with the storage medium through the bus, and the processor executes the machine-readable instructions to execute the steps of the data stream identification method of the first aspect.

In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the steps of the data flow identification method in the first aspect are performed.

The beneficial effect of this application is:

the embodiment of the application provides a data stream identification method, a device, equipment and a storage medium, wherein the method comprises the following steps: determining a data stream where each data packet is located according to received first packet header information of each data packet to obtain at least one data stream to be identified, wherein each data stream to be identified comprises at least one data packet, and the first packet header information of each data packet in the same data stream to be identified is consistent; obtaining characteristic information of the data stream to be identified according to the second packet header information of each data packet in the data stream to be identified; and inputting the characteristic information of the data stream to be identified into a target flow identification model obtained by pre-training, and predicting whether the data stream to be identified is an onion routing data stream.

By adopting the data stream identification method provided by the embodiment of the application, the characteristic information for representing the type of the data stream to be identified can be obtained according to the attributes of the data packet included in the data stream to be identified, such as the data length and the arrival time carried in the packet header information, and the characteristic information can accurately represent the real information of the data stream to be identified. Based on the above, the feature information of the data stream to be identified is input into the target traffic identification model, and the feature information of the data stream to be identified is analyzed and processed by using the target traffic identification model, so that whether the data stream to be identified is an onion routing data stream, namely a Tor data stream, can be accurately predicted. It can be seen that this can improve the accuracy of identifying the Tor data stream.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.

FIG. 1 is a block diagram of a flow chart of a model training phase according to an embodiment of the present disclosure;

FIG. 2 is a flow diagram of a model application phase provided by an embodiment of the present application;

fig. 3 is a schematic flowchart of a data stream identification method according to an embodiment of the present application;

fig. 4 is a schematic flowchart of another data stream identification method according to an embodiment of the present application;

fig. 5 is a schematic flowchart of another data stream identification method according to an embodiment of the present application;

fig. 6 is a schematic flowchart of another data stream identification method according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a data stream identification apparatus according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations.

Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.

Before explaining the embodiments of the present application in detail, an application scenario of the present application will be described first. The application scenario may specifically be a scenario for identifying a data stream category in a network communication environment, and the application divides the data stream type into a Tor data stream and a non-Tor data stream, where the non-Tor data stream may be understood as a data stream having a specific application protocol, and the application protocol may be, for example, an application protocol corresponding to Tencent video, an application protocol corresponding to WeChat, and the like; conversely, the data stream whose Protocol type is some non-specific application Protocol may be a Tor data stream, and the non-specific application Protocol may include, for example, HTTPS (Hypertext Transfer Protocol Secure), TCP (Transmission Control Protocol), and the like.

It can be understood that the Tor network is designed at the beginning to protect privacy and avoid tracking of users on the internet, but is largely applied to illegal applications, so how to accurately identify Tor data streams is a problem to be solved urgently at present.

Based on the above mentioned problems, the data flow is identified based on deep learning, tor data flow is identified, and then Tor data flow is timely controlled.

The following examples of the present application generally include two phases, a model training phase and a model application phase. Fig. 1 is a flow framework diagram of a model training phase according to an embodiment of the present disclosure, and as shown in fig. 1, a basic flow of the model training phase may include sample data acquisition, sample class labeling, feature information extraction, and model training. For example, the electronic device may collect a preset amount of sample data according to actual needs; labeling sample types for each sample data according to the source of each sample data, wherein a Tor data stream type can be represented by a label 0, and a non-Tor data stream type can be represented by a label 1; extracting characteristic information of sample data; and inputting the characteristic information and the label corresponding to each sample data into the initial flow identification model for training. It should be noted that, the specific contents of the model training phase can be described with reference to the following related example section, and will not be further described here.

Fig. 2 is a flow framework diagram of a model application phase according to an embodiment of the present disclosure, and as shown in fig. 2, a basic flow of the model application phase may include extracting feature information of a data stream to be identified, performing online prediction, and reporting a result. For example, the electronic device may automatically classify each data packet according to packet header information of each data packet to obtain a plurality of data streams, each data stream may be referred to as a data stream to be identified, extract feature information of the data stream to be identified according to a data length of the data stream to be identified, input the feature information of the data stream to be identified into a target traffic identification model trained in the manner of fig. 1 to perform online prediction, and when a prediction result is that the data stream to be identified is a Tor data stream, may send a prompt message to a preset port, that is, report a result. It should be noted that specific contents of the model application phase can be described with reference to the following related example section, and will not be further described here.

Therefore, the method and the device can effectively solve the problem that the Tor data stream cannot be accurately identified by the traditional protocol characteristics and the problem that the Tor data stream cannot be accurately identified by the traditional specific port, namely the accuracy of identifying the Tor data stream can be improved.

The data stream identification method mentioned in the present application is first illustrated with reference to fig. 2 as follows. Fig. 3 is a schematic flowchart of a data stream identification method according to an embodiment of the present application. As shown in fig. 3, the method may include:

s301, determining the data stream where each data packet is located according to the received first packet header information of each data packet, and obtaining at least one data stream to be identified.

Each data stream to be identified comprises at least one data packet, the first packet header information of each data packet in the same data stream to be identified is consistent, and the first packet header information comprises: user IP, network IP, user port, network port, and protocol identification. The user may be understood as a source, i.e. the end that sends the data packet, and the network may be understood as a destination, i.e. the end that receives the data packet.

For example, the above-mentioned electronic device receives a data packet in real time, after receiving the data packet, the electronic device may extract a header portion in the data packet, acquire first header information in the header portion, and determine a data stream to which the data packet belongs according to a user IP, a network IP, a user port, a network port, and a protocol identifier in the first header information and a correspondence between the first header information and the data stream that is established in advance. The electronic device may process each received data packet as described above, and each data packet may have a data stream, so that the electronic device may obtain at least one data stream.

For example, if the number of packets in a certain data stream reaches a preset number (for example, 20), the packets may be regarded as packets to be identified, and the following steps may be performed.

And S102, obtaining the characteristic information of the data stream to be identified according to the second header information of each data packet in the data stream to be identified.

Wherein, the second packet header information: the data length and the arrival time, and the characteristic information of the data stream to be identified comprises: and the run information, the uplink load information, the downlink load information and the time interval information of the data stream to be identified.

Taking a data stream to be identified as an example, the header portion of each data packet included in the data stream to be identified is extracted, and second header information including the data length is acquired from the header portion of each data packet. After the data length and the arrival time corresponding to each data packet are obtained, the data length and the arrival time corresponding to each data packet can be calculated according to various feature calculation strategies, such as average, standard deviation and the like, so as to obtain some features for representing the category of the data stream to be identified, such as run information, uplink load information, downlink load information and time interval information, and of course, other features can be included, which are not limited in the application.

That is to say, the class of the data stream to be identified can be represented more accurately by using the feature information of the data stream to be identified, and then the class of the data stream to be identified can be obtained after the feature information is identified by using the target traffic identification model.

S103, inputting the characteristic information of the data stream to be identified into a target flow identification model obtained through pre-training, and identifying whether the data stream to be identified is an onion routing data stream.

For example, the run length information, the uplink load information, the downlink load information, and the time interval information included in the feature information of the data stream to be recognized may be sorted according to a preset sequence to obtain input data, the input data is input into the target traffic recognition model, the target traffic recognition model may predict whether the data stream to be recognized is an onion routing data stream after analyzing the input data, and a prediction result output by the target traffic recognition model includes that the data stream to be recognized is an onion routing data stream, that is, a Tor data stream, or that the data stream to be recognized is a non-Tor data stream.

In summary, in the data stream identification method provided by the present application, the feature information used for characterizing the type of the data stream to be identified can be obtained according to the attributes of the data packets included in the data stream to be identified, such as the data length carried in the packet header information, and the feature information can accurately represent the real information of the data stream to be identified. Based on the above, the feature information of the data stream to be identified is input into the target traffic identification model, and the feature information of the data stream to be identified is analyzed and processed by using the target traffic identification model, so that whether the data stream to be identified is an onion routing data stream, namely a Tor data stream, can be accurately predicted. It can be seen that this can improve the accuracy of identifying the Tor data stream.

Fig. 4 is a schematic flowchart of another data stream identification method according to an embodiment of the present application. As shown in fig. 4, optionally, the obtaining the feature information of the data stream to be identified according to the second header information of each data packet in the data stream to be identified includes:

s401, obtaining the data length and the arrival time of each data packet according to the second packet header information of each data packet in the data stream to be identified.

S402, extracting the characteristic information of the data stream to be identified according to the data length and the arrival time of each data packet and the extraction strategy corresponding to each type of characteristic information.

As can be seen from the above description, each data packet in the data stream to be identified includes a header portion, second header information may be obtained from the header portion, and a data length included in the second header information may be composed of a payload length, a check value length, an arrival time, and the like.

For example, the payload length may be extracted from the data length, after the payload length of each data packet in the data stream to be identified is obtained, the data packets corresponding to various data length features may be obtained according to the payload length of each data packet in the data stream to be identified and the extraction policy corresponding to various data length feature information, and then the feature information of each data length feature may be obtained by calculation according to the payload length of the data packet corresponding to various data length features.

For another example, after the arrival time of each data packet in the data stream to be identified is obtained, the data packets corresponding to various time features may be obtained according to the arrival time of each data packet in the data stream to be identified and the extraction strategies corresponding to various time feature information, and further, the feature information of various time features may be obtained by calculating according to the time lengths of the data packets corresponding to various time features.

Further, the characteristic information of various data length characteristics and the characteristic information of various time characteristics can be combined into the characteristic information of the data stream to be identified.

Optionally, the extracting the feature information of the data stream to be identified according to the payload length of each data packet and the extraction policy corresponding to each type of feature information includes: screening at least one first data packet associated with the run from all data packets of the data stream to be identified, and determining the run information of the data stream to be identified according to the payload length of each first data packet; screening out at least one second data packet and at least one third data packet which are respectively associated with the uplink message and the downlink message from all data packets of the data stream to be identified, and determining uplink load information and downlink load information of the data stream to be identified according to the effective load length of each second data packet and the effective load length of each third data packet; and determining the time interval information of the data stream to be identified according to the arrival time of all data packets of the data stream to be identified.

For example, the data packets in the data stream to be identified may be divided into data packets belonging to a first run and data packets belonging to a second run according to the directionality of the data packets (e.g., from the a end to the B end, or from the B end to the a end), where the first run is used to indicate from the a end to the B end, and the second run is used to indicate from the B end to the a end, where the first run is taken as an example, the payload length of the data packets (first data packets) belonging to the first run is obtained, the payload lengths of the first data packets are added, and the total payload length obtained by the addition is taken as the run information of the data stream to be identified.

In another example, at least one second data packet and at least one third data packet, which are respectively associated with the uplink packet and the downlink packet, in the first 20 data packets of the data stream to be identified are obtained. Obtaining the average length of the uplink load, the maximum length of the uplink load and the minimum length of the uplink load according to the effective load length of each second data packet, wherein the uplink load information comprises the average length of the uplink load, the maximum length of the uplink load and the minimum length of the uplink load; and obtaining the average length of the downlink load, the maximum length of the downlink load and the minimum length of the downlink load according to the effective load length of each third data packet, namely the uplink load information comprises the average length of the downlink load, the maximum length of the downlink load and the minimum length of the downlink load, namely the downlink load information comprises the average length of the downlink load, the maximum length of the downlink load and the minimum length of the downlink load.

Illustratively, the message arrival time is extracted from the header information of the first 20 data packets of the data stream to be identified, so as to obtain the information of the average time interval, the maximum time interval, the minimum time interval, and the like included in the time interval information.

Fig. 5 is a schematic flowchart of another data stream identification method according to an embodiment of the present application. As shown in fig. 5, optionally, the obtaining the feature information of the data stream to be identified according to the second header information of each data packet in the data stream to be identified includes:

s501, carrying out protocol identification on the load information of each data packet in the data stream to be identified to obtain a protocol identification result.

Wherein, the protocol identification result comprises: recognizable and unrecognizable. The recognizable protocol can be understood as that the protocol recognition result includes a specific application protocol, such as an application protocol corresponding to Tencent video, an application protocol corresponding to WeChat, and the like; the failure to recognize the data stream to be recognized as a specific application protocol may be understood as no protocol type in the protocol recognition result, or the data stream to be recognized may not be recognized as a specific application protocol, that is, the protocol of the data stream to be recognized may be some non-specific application protocols, such as HTTPS, TCP, and the like.

For an exemplary example, a Deep packet analysis (DPI) module is preconfigured in the electronic device, each data packet in the data stream to be identified is input into the DPI module, the DPI module extracts a data portion of each data packet, and then performs protocol identification on load information in the data portion of each data packet, and outputs a protocol identification result.

If the protocol identification result is identifiable, directly filtering the data stream to be identified, namely forwarding the data to be identified without performing subsequent feature extraction on the data stream to be identified; if the protocol identification result is unidentifiable, subsequent feature extraction needs to be carried out on the data stream to be identified.

And S502, if the protocol identification result is that the data stream to be identified cannot be identified, performing randomness detection on the data stream to be identified to obtain a random value, and obtaining the characteristic information of the data stream to be identified according to the random value and the second packet header information of each data packet.

Illustratively, when the result of the protocol identification is that the data stream to be identified cannot be identified, that is, the protocol of the data stream to be identified is not the specific application protocol, then the data stream to be identified is subjected to randomness detection. It should be understood that, the encryption algorithm used by the Tor data stream makes the data in the corresponding data packet present a random state, and the degree of the random state is greater than that presented by the non-Tor data stream, so after performing the random detection on the data packet to be identified, the feature information of the data stream to be identified is obtained according to the obtained random value and the second header information of each data packet in the data stream to be identified.

Optionally, the obtaining, according to the random value and the second header information of each data packet, the feature information of the data stream to be identified includes: and if the random value is greater than or equal to the preset threshold value, obtaining the characteristic information of the data stream to be identified according to the random value and the second packet header information of each data packet.

It should be understood that the larger the random value, the more cluttered the packets in the data stream to be identified, whereas the smaller the random value, the less cluttered the packets in the data stream to be identified.

For example, a random value corresponding to the data stream to be identified is compared with a preset threshold, and if the comparison result indicates that the random value is greater than or less than the preset threshold, it is proved that the probability that the data stream to be identified is the Tor data stream is relatively high, and the data stream to be identified needs to be identified by using a target traffic identification model according to the feature information of the data stream to be identified. If the comparison result indicates that the random value is smaller than the preset threshold value, the data stream to be identified is proved to be a non-Tor data stream, and the data stream to be identified is directly filtered, namely the data stream to be identified is forwarded based on the network IP.

The data flow to be identified can be filtered for the first time by using a protocol identification mode, the data flow to be identified can be filtered for the second time by using a randomness detection mode, and finally, the data flow to be identified with higher probability of being the Tor data flow is identified by combining with a target flow identification model, so that the efficiency can be improved, and the false identification rate can be greatly reduced.

Optionally, the method further comprises: and if the data stream to be identified is the onion routing data stream, sending reminding information to a preset port.

The reminding information comprises the user IP of the data stream to be identified, so that a worker can conveniently search the source of the Tor data stream according to the user IP of the data stream to be identified.

Fig. 6 is a flowchart illustrating a further data stream identification method according to an embodiment of the present application. As shown in fig. 6, optionally, before the feature information of the data stream to be recognized is input into the target traffic recognition model obtained through pre-training and the data stream to be recognized is recognized as an onion routing data stream, the method further includes:

s601, obtaining a plurality of sample data packets respectively included in each sample data stream from a plurality of preset application terminals.

And S602, obtaining characteristic information of each sample data stream according to the second header information of each sample data packet in each sample data stream.

The packet capturing method comprises the steps of capturing packets from a plurality of preset application terminals according to actual requirements, and obtaining a plurality of sample data packets output by each preset application terminal. For example, a plurality of sample data packets output by each preset application terminal may be obtained according to the number of preset data packets, and assuming that the number of the preset data packets is 20, only 20 sample data packets need to be captured by each preset application terminal, so as to obtain a sample data stream corresponding to each preset application terminal.

Taking a sample data stream as an example for description, the payload length carried in the second header information of each sample data packet in the sample data stream is obtained, and then the characteristic information of the sample data stream can be obtained according to the payload length of each sample data packet, where the characteristic information may include run length information, uplink load information, downlink load information, and time interval information of the sample data stream.

And S603, determining the label of each sample data stream according to the identifier of each preset application end.

And S604, constructing a training sample according to the characteristic information and the label of each sample data stream.

The preset application end may include a flight end, a WeChat end, a Tor end, etc. For example, the label of the sample data stream corresponding to the flight terminal and the WeChat terminal may be set to 1, the label of the sample data stream corresponding to the Tor terminal may be set to 0, the label 0 may represent the Tor data stream, and the label 1 may represent the non-Tor data stream. And associating the feature information and the label of each sample data stream to obtain a training sample corresponding to each sample data stream, namely each training sample comprises the feature information and the label corresponding to the feature information.

And S605, inputting the training sample into the initial flow identification model for training to obtain a target flow identification model.

The training samples corresponding to the sample data streams may be input into the initial traffic recognition model, respectively. Here, a description will be given taking one training sample as an example, where feature information in the training sample is used as an input of the initial traffic recognition model, a label in the training sample is used as an output of the initial traffic recognition model, the initial traffic recognition model is trained, and when a training stop condition is satisfied, the target traffic recognition model can be trained.

Fig. 7 is a schematic structural diagram of a data stream identification device according to an embodiment of the present application. As shown in fig. 7, the apparatus includes:

a first determining module 701, configured to determine, according to first packet header information of each received data packet, a data stream where each data packet is located, to obtain at least one data stream to be identified, where each data stream to be identified includes at least one data packet, and the first packet header information of each data packet in the same data stream is consistent, where the first packet header information includes: user IP, network IP, user port, network port and protocol identification;

a second determining module 702, configured to obtain, according to second header information of each data packet in the data stream to be identified, feature information of the data stream to be identified, where the second header information: the data length and the arrival time, and the characteristic information of the data stream to be identified comprises: run information, uplink load information, downlink load information and time interval information of a data stream to be identified;

the identifying module 703 is configured to input feature information of the data stream to be identified into a target traffic identification model obtained through pre-training, and identify whether the data stream to be identified is an onion routing data stream.

Optionally, the second determining module 702 is specifically configured to obtain the payload length and the arrival time of each data packet according to the second header information of each data packet in the data stream to be identified; and extracting the characteristic information of the data stream to be identified according to the effective load length and the arrival time of each data packet and the extraction strategy corresponding to each type of characteristic information.

Optionally, the second determining module 702 is further specifically configured to screen at least one first data packet associated with a run from all data packets of the data stream to be identified, and determine the run information of the data stream to be identified according to the payload length of each first data packet; screening out at least one second data packet and at least one third data packet which are respectively associated with the uplink message and the downlink message from all data packets of the data stream to be identified, and determining uplink load information and downlink load information of the data stream to be identified according to the effective load length of each second data packet and the effective load length of each third data packet; and determining the time interval information of the data stream to be identified according to the arrival time of all data packets of the data stream to be identified.

Optionally, the second determining module 702 is further specifically configured to perform protocol identification on the load information of each data packet in the data stream to be identified, so as to obtain a protocol identification result, where the protocol identification result includes: recognizable and unrecognizable; and if the protocol identification result is that the data flow cannot be identified, performing randomness detection on the data flow to be identified to obtain a random value, and obtaining the characteristic information of the data flow to be identified according to the random value and the second header information of each data packet.

Optionally, the second determining module 702 is further specifically configured to, if the random value is greater than or equal to the preset threshold, obtain the feature information of the data stream to be identified according to the random value and the second header information of each data packet.

Optionally, the apparatus comprises: a sending module;

the sending module is used for sending reminding information to a preset port if the data stream to be identified is an onion routing data stream, wherein the reminding information comprises a user IP (Internet protocol) of the data stream to be identified.

Optionally, the apparatus further comprises:

the second determining module 702 is further configured to obtain feature information of each sample data stream according to second header information of each sample data packet in each sample data stream;

a second determining module 702, configured to determine, according to the identifier of each preset application end, a label of each sample data stream;

and the training module is used for inputting the training samples into the initial flow identification model for training to obtain a target flow identification model.

The above-mentioned apparatus is used for executing the method provided by the foregoing embodiment, and the implementation principle and technical effect are similar, which are not described herein again.

These above modules may be one or more integrated circuits configured to implement the above methods, such as: one or more Application Specific Integrated Circuits (ASICs), or one or more microprocessors (DSPs), or one or more Field Programmable Gate Arrays (FPGAs), among others. For another example, when one of the above modules is implemented in the form of a Processing element scheduler code, the Processing element may be a general-purpose processor, such as a Central Processing Unit (CPU) or other processor capable of calling program code. For another example, these modules may be integrated together and implemented in the form of a system-on-a-chip (SOC).

Fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present application, and as shown in fig. 8, the electronic device may include: a processor 801, a storage medium 802 and a bus 803, the storage medium 802 storing machine readable instructions executable by the processor 801, the processor 801 communicating with the storage medium 802 via the bus 803 when the electronic device is operating, the processor 801 executing the machine readable instructions to perform the steps of the above method embodiments. The specific implementation and technical effects are similar, and are not described herein again.

Optionally, the present application further provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the computer program performs the steps of the above method embodiments.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.

The integrated unit implemented in the form of a software functional unit may be stored in a computer readable storage medium. The software functional unit is stored in a storage medium and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device) or a processor (processor) to perform some steps of the methods according to the embodiments of the present application. And the aforementioned storage medium includes: a U disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

It is noted that, in this document, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising a … …" does not exclude the presence of another identical element in a process, method, article, or apparatus that comprises the element.

The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application. It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures. The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A data stream identification method, the method comprising:

obtaining feature information of the data stream to be identified according to second header information of each data packet in the data stream to be identified, where the second header information: the data length and the arrival time, and the characteristic information of the data stream to be identified comprises: run information, uplink load information, downlink load information and time interval information of the data stream to be identified;

2. The method according to claim 1, wherein obtaining the characteristic information of the data stream to be identified according to the second header information of each data packet in the data stream to be identified comprises:

3. The method according to claim 2, wherein said extracting the feature information of the data flow to be identified according to the payload length and the arrival time of each data packet and the extraction policy corresponding to each type of feature information comprises:

screening out at least one first data packet associated with a run from all data packets of the data stream to be identified, and determining run information of the data stream to be identified according to the effective load length of each first data packet;

4. The method of claim 1, wherein the obtaining the characteristic information of the data stream to be identified according to the second header information of each data packet in the data stream to be identified comprises:

5. The method according to claim 4, wherein obtaining the characteristic information of the data stream to be identified according to the random value and the second header information of each of the data packets comprises:

6. The method of claim 1, further comprising:

7. The method of claim 1, wherein before inputting the characteristic information of the data stream to be identified into a pre-trained target traffic identification model and identifying whether the data stream to be identified is an onion routing data stream, the method further comprises:

obtaining a plurality of sample data packets respectively included in each sample data stream from a plurality of preset application terminals;

obtaining characteristic information of each sample data stream according to second header information of each sample data packet in each sample data stream;

8. An apparatus for identifying data streams, the apparatus comprising:

a first determining module, configured to determine, according to first packet header information of each received data packet, a data stream where each data packet is located, to obtain at least one data stream to be identified, where each data stream to be identified includes at least one data packet, and first packet header information of each data packet in the same data stream is consistent, where the first packet header information includes: user IP, network IP, user port, network port and protocol identification;

9. An electronic device, comprising: a processor, a storage medium and a bus, the storage medium storing machine-readable instructions executable by the processor, the processor and the storage medium communicating via the bus when the electronic device is operating, the processor executing the machine-readable instructions to perform the steps of the data stream identification method according to any one of claims 1 to 7.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the data flow identification method according to any one of claims 1 to 7.