CN116319107B

CN116319107B - Data traffic identification model training method and device

Info

Publication number: CN116319107B
Application number: CN202310579054.7A
Authority: CN
Inventors: 饶思哲
Original assignee: Xinhuasan Artificial Intelligence Technology Co ltd
Current assignee: Xinhuasan Artificial Intelligence Technology Co ltd
Priority date: 2023-05-19
Filing date: 2023-05-19
Publication date: 2023-08-18
Anticipated expiration: 2043-05-19
Also published as: CN116319107A

Abstract

The embodiment of the invention provides a data traffic identification model training method and device, and relates to the technical field of networks, wherein the method comprises the following steps: acquiring sample characteristic information of sample data flow; generating a sample feature vector of the sample data flow based on the sample feature information, wherein the number of channels of the sample feature vector is the same as the number of terms of the sample feature information; expanding each channel in the sample feature vector to a target number of channels to obtain a sample feature vector with the channels adjusted; training a data traffic recognition model based on the sample feature vector after the channel adjustment, wherein the data traffic recognition model is used for recognizing whether the data traffic is malicious traffic or not. By applying the scheme provided by the embodiment of the invention, the identification accuracy of malicious traffic can be improved.

Description

Data traffic identification model training method and device

Technical Field

The embodiment of the invention relates to the technical field of networks, in particular to a data traffic identification model training method and device.

Background

Malicious traffic in the network occupies bandwidth and affects the transmission of other normal traffic, so that the malicious traffic needs to be limited to ensure the stability of the transmission of various normal traffic. Related art DFI (Deep/Dynamic Flow Inspection, deep/dynamic flow detection) may be performed on data traffic to identify the data traffic as malicious or normal. However, in the related art, in the process of identifying the data traffic by using the DFI, the acquired data traffic characteristics are often subjected to 0 supplementation, so that the data traffic characteristics are affected, and the identification result of the malicious traffic obtained based on the data traffic characteristics is further affected; or directly filtering out data traffic with characteristic missing, which can lead to partial data traffic being unrecognized. That is, the adoption of the related technical scheme can make the malicious traffic identification result of the data traffic inaccurate.

Disclosure of Invention

The embodiment of the invention aims to provide a data traffic identification model training method and device so as to improve the identification accuracy of malicious traffic. The specific technical scheme is as follows:

in a first aspect, an embodiment of the present invention provides a data traffic recognition model training method, where the method includes:

acquiring sample characteristic information of sample data flow;

generating a sample feature vector of the sample data flow based on the sample feature information, wherein the number of channels of the sample feature vector is the same as the number of items of the sample feature information;

expanding each channel in the sample feature vector to a target number of channels to obtain a channel-adjusted sample feature vector, wherein the target number is a quotient of a preset channel number and the channel number of the sample feature vector, the preset channel number is a common multiple of a continuous natural number, and the maximum value of the continuous natural number is the channel number of the sample feature vector under the condition that no feature loss exists;

and training a data flow identification model based on the sample feature vector after the channel adjustment, wherein the data flow identification model is used for identifying whether the data flow is malicious or not.

In one embodiment of the present invention, the expanding each channel in the sample feature vector to a target number of channels to obtain a channel-adjusted sample feature vector includes:

expanding each channel in the sample feature vector to a target number of channels;

carrying out preset feature vector processing on the feature vector after the channel expansion to obtain a processing result;

pyramid pooling is carried out on the processing result to obtain a one-dimensional vector which is used as a sample characteristic vector after channel adjustment.

In one embodiment of the present invention, the expanding each channel in the sample feature vector to a target number of channels includes:

and for each channel in the sample feature vector, expanding the channel to a preset channel number, and compressing the preset channel number obtained by expansion into a target channel number.

In one embodiment of the present invention, the sample characteristic information is at least one of TCP information, TLS information, DNS information, and HTTP information of the sample data flow.

In one embodiment of the invention, the data traffic recognition model is trained each time using one of the channel-adjusted sample feature vectors.

In a second aspect, an embodiment of the present invention provides a training device for a data traffic recognition model, where the device includes:

the acquisition module is used for acquiring sample characteristic information of the sample data flow;

a generating module, configured to generate a sample feature vector of the sample data flow based on the sample feature information, where the number of channels of the sample feature vector is the same as the number of terms of the sample feature information;

the expansion module is used for expanding each channel in the sample feature vector to a target number of channels to obtain a sample feature vector with the channels adjusted, wherein the target number is a quotient of a preset channel number and the channel number of the sample feature vector, the preset channel number is a common multiple of a continuous natural number, and the maximum value of the continuous natural number is the channel number of the sample feature vector under the condition that no feature loss exists;

the training module is used for training a data flow identification model based on the sample feature vector after the channel adjustment, and the data flow identification model is used for identifying whether the data flow is malicious or not.

In one embodiment of the present invention, the expansion module is specifically configured to:

and aiming at each channel in the sample feature vector, expanding the channel to a preset channel number, compressing the preset channel number obtained by expansion into a target channel number, and obtaining the sample feature vector with the channel adjusted.

In a third aspect, an embodiment of the present invention provides a training electronic device for a data traffic recognition model, including a processor, a communication interface, a memory, and a communication bus, where the processor, the communication interface, and the memory complete communication with each other through the communication bus;

A memory for storing a computer program;

a processor for implementing the method steps of any one of the first aspects when executing a program stored on a memory.

In a fourth aspect, embodiments of the present invention provide a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method steps of any of the first aspects.

The embodiment of the invention has the beneficial effects that:

the embodiment of the invention provides a data flow identification model training method, which is used for acquiring sample characteristic information of sample data flow; generating a sample feature vector of the sample data flow based on the sample feature information, wherein the number of channels of the sample feature vector is the same as the number of terms of the sample feature information; expanding each channel in the sample feature vector to a target number of channels to obtain a channel-adjusted sample feature vector, wherein the target number is a quotient of a preset channel number and the channel number of the sample feature vector, the preset channel number is a common multiple of a continuous natural number, and the maximum value of the continuous natural number is the channel number of the sample feature vector under the condition that no feature loss exists; training a data traffic recognition model based on the sample feature vector after the channel adjustment, wherein the data traffic recognition model is used for recognizing whether the data traffic is malicious traffic or not.

As can be seen from the above, in the solution provided in the embodiment of the present invention, since the target number is a quotient of the preset number of channels and the number of channels in the sample feature vector, after each channel in the sample feature vector is extended to the target number of channels, the total number of channels of the sample feature vector after the channel adjustment is obtained is the preset number of channels. And, because the number of the preset channels is a common multiple of the continuous natural number, and the maximum value of the continuous natural number is the number of channels of the sample feature vector under the condition that no feature loss exists, no matter how many items of the feature information of the sample data flow are, the total number of channels of the sample feature vector after channel adjustment is always the number of the preset channels, that is, the sample data flow with the feature information loss can also be processed, so that the sample feature vector after channel adjustment with the total number of channels being the number of the preset channels is obtained, and the subsequent operation is performed. In the process, the sample data flow characteristics are not supplemented with 0, or the sample data flow with characteristic missing is directly filtered, the sample data flow is completely identified based on the characteristic information of the acquired sample data flow, and the identification accuracy of the data flow identification model obtained through training on malicious flow can be improved.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the invention, and other embodiments may be obtained according to these drawings to those skilled in the art.

FIG. 1 is a flow chart of a training method of a data traffic recognition model in the related art;

fig. 2 is a flow chart of a first data traffic recognition model training method according to an embodiment of the present invention;

FIG. 3 is a flow chart of a second method for training a data traffic recognition model according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a maximum pooling method in the related art;

FIG. 5 is a flow chart of a third data traffic recognition model training method according to an embodiment of the present invention;

FIG. 6 is a flowchart of a fourth data traffic recognition model training method according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of a training device for a data traffic recognition model according to an embodiment of the present invention;

fig. 8 is a schematic structural diagram of a training electronic device for a data traffic recognition model according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by the person skilled in the art based on the present invention are included in the scope of protection of the present invention.

In the related art, when the DFI is used to train the data flow identification model, the following four sample characteristic information of the sample data flow needs to be obtained: TCP (Transmission Control Protocol ) information, TLS (Transport Layer Security, secure transport layer protocol) information, DNS (Domain Name System ) information, and HTTP (Hyper Text Transfer Protocol ) information.

However, there are often cases where sample characteristic information of the actually acquired sample data traffic is missing, for example, DNS information or HTTP information of the sample data traffic is not acquired. Under the condition that sample characteristic information is missing, the related technology supplements 0 to the acquired sample characteristic information, so that the sample characteristic information is influenced, and further a malicious flow identification result obtained based on the sample characteristic information is influenced; or directly filtering out sample data traffic with sample characteristic information missing and not identifying malicious traffic, so that part of sample data traffic cannot be identified, and the malicious traffic identification result is inaccurate.

Referring to fig. 1, a flow chart of a training method of a data traffic recognition model in the related art is shown. In fig. 1, a black sample refers to a sample of malicious traffic; white samples refer to samples of normal traffic. The sample data flow contains characteristic information, specifically including TCP information, TLS information and DNS/HTTP information. The feature vector standardization is carried out on a part of data in the samples formed by the black samples and the white samples, namely the obtained samples are standardized to form feature vectors with uniform dimensions, the model is trained based on the samples standardized by the feature vectors, and a trained model is generated to form a model library. And extracting verification characteristic data from the other part of data in the sample, and verifying the trained model in the model library by using the extracted verification characteristic data to determine whether the accuracy of the model reaches the standard. A data traffic identification model is generated based on the model library, the data traffic identification model being used to distinguish whether traffic is normal traffic or malicious traffic.

The data traffic is real traffic data in the network, including TLS traffic and DNS traffic, data traffic characteristic information of the data traffic including TCP information, TLS information and DNS/HTTP information is obtained by using a flow probe, characteristic data is extracted based on the obtained information and is input into a data traffic identification model, and an output result of whether the data traffic is normal traffic or malicious traffic is obtained.

However, the data traffic recognition model training method in the related art requires that the characteristic information of the sample data traffic contains complete TCP information, TLS information and DNS/HTTP information, and the sample data traffic with missing characteristic information is selected to be ignored and not processed, that is, the malicious traffic recognition result is inaccurate by adopting the related art scheme.

In order to solve the above problems, the embodiments of the present invention provide a method and an apparatus for training a data traffic recognition model, which are specifically described below.

Firstly, a training method of a data traffic recognition model provided by the embodiment of the invention is explained.

Referring to fig. 2, a flowchart of a first data traffic recognition model training method according to an embodiment of the present invention is provided, and the method may be applied to any device having computing capability, such as a cloud platform, a server, etc., and the method includes the following steps S201 to S204.

Step S201: sample characteristic information of sample data flow is obtained.

Specifically, the sample feature information may be collected by the network device and sent to any device with computing capability, and the sample feature information may also be prestored by the execution body of the embodiment.

In one embodiment of the present invention, the sample feature information is at least one of TCP information, TLS information, DNS information and HTTP information of the sample data flow, and each item of information in the sample feature information should be theoretically acquired, but in the actual processing process, there may be a case that sample feature information of a part of items cannot be acquired, which leads to a missing sample feature information.

According to the network working principle, the source IP (Internet Protocol ) of the TLS information of the sample data flow is consistent with the source IP of the TCP information, the destination IP of the TLS information is consistent with the destination IP of the TCP information, the destination IP of the DNS information is consistent with the destination IP of the TLS information, and the source IP of the HTTP information is consistent with the source IP of the TLS information.

The mode of extracting the sample characteristic information may be extracting TCP information, TLS information, DNS information and HTTP information of the sample data traffic according to the quintuple of the sample data traffic.

The TCP information mainly comprises information such as packet length, packet number, interval time between each data packet, byte distribution and the like of the data packets in the sample data flow.

The TLS information mainly includes handshake information, such as cipher suite information and TLS extension.

The DNS information mainly includes suffix information of a website To be accessed, a TTL (Time To Live) value, and the like. For example, the suffix information may be aaa.com, bbb.com, ccc.com.

The HTTP information mainly includes information such as a user agent (user agent) and a server (server).

In the obtained sample characteristic information, for numerical value type information such as packet length, packet number and the like, a numerical value can be used as a one-dimensional vector; for string class data, such as suffix information in DNS information, it can be expanded using one-hot (one-hot) form to become a multidimensional vector.

For example, for 3 possible suffix information aaa.com, bbb.com, ccc.com, a website is test.bbb.com, and after one-hot processing, the suffix information of the website may be represented by a 3-dimensional vector [0,1,0 ]; if a website is test.aaa.com, the suffix information of the website can be expressed as a 3-dimensional vector [1, 0] after one-hot processing.

Step S202: and generating a sample characteristic vector of the sample data flow based on the sample characteristic information.

Wherein the number of channels of the sample feature vector is the same as the number of items of the sample feature information.

In one embodiment of the present invention, the sample characteristic information is at least one of TCP information, TLS information, DNS information, and HTTP information of the sample data traffic. Specifically, taking the sample feature information as TCP information, TLS information, DNS information, and HTTP information of the sample data traffic as examples, the sample feature vector of the sample data traffic is generated by combining the TCP information, TLS information, DNS information, and HTTP information of the sample data traffic into one feature vector group, and the feature vector group is the sample feature vector. When the sample feature information is not missing, the number of channels of the sample feature vector is 4, if one piece of feature information is missing, the number of channels of the sample feature vector is 3, if two pieces of feature information are missing, the number of channels of the sample feature vector is 2, in theory, DNS information and HTTP information in the sample feature information are easier to be missing, and the possibility of missing TCP information and TLS information is lower, so that the number of channels of the sample feature vector is often 2, 3 or 4.

Specifically, each dimension in the sample feature vector is a channel, and each channel corresponds to one item of information in the sample feature information, for example: if the sample characteristic information is TCP information, TLS information, DNS information and HTTP information of the sample data flow, the number of channels of the sample characteristic vector is 4, the dimension is 4, and each channel corresponds to the TCP information, the TLS information, the DNS information and the HTTP information in the sample characteristic information respectively; if the sample feature information is TCP information, TLS information, and DNS information of the sample data flow, the number of channels of the sample feature vector is 3, the dimension is 3, and each channel corresponds to the TCP information, TLS information, and DNS information in the sample feature information, and so on.

In one embodiment of the present invention, dimensions of each item of sample feature information obtained may be different, dimensions of each item of sample feature information may be unified before a sample feature vector is generated, specifically, dimensions of sample feature information with a highest dimension may be unified, and dimension unification may be performed by adopting a 0-complement manner.

For example, the dimension of TCP information is 372 dimensions, the dimension of TLS information is 469 dimensions, the dimension of DNS information is 50 dimensions, and the dimension of HTTP information is 407 dimensions. The TCP information of 372 dimensions, the DNS information of 50 dimensions and the HTTP information of 407 dimensions may be respectively supplemented with 0, so that the TCP information, the DNS information and the HTTP information are 469 dimensions, and based on this, the obtained sample feature vector is a vector with 4 channels and 469 dimensions, which may be represented as [4, 469].

Step S203: and expanding each channel in the sample feature vector to a target number of channels to obtain a sample feature vector with the channels adjusted.

The target number is the quotient of the preset channel number and the channel number of the sample feature vector, the preset channel number is a common multiple of a continuous natural number, and the maximum value of the continuous natural number is the channel number of the sample feature vector under the condition that no feature loss exists.

In one embodiment of the present invention, the sample feature information is at least one of TCP information, TLS information, DNS information and HTTP information of the sample data traffic, and the sample feature information indicates that the sample feature information includes TCP information, TLS information, DNS information and HTTP information of the sample data traffic without a feature loss, and the number of channels of the sample feature vector is 4, so that the maximum value of the continuous natural number is 4, and the number of preset channels is a common multiple of 1, 2, 3, 4, such as 12, 24, 48, and the like.

Taking the preset channel number as 12 as an example, if the channel number of the sample feature vector is 4, the target number is 3, in this case, each channel in the sample feature vector is expanded to 3 channels, and 12 channels are used to obtain the sample feature vector after channel adjustment; if the number of channels of the sample feature vector is 3, the target number is 4, in which case, each channel in the sample feature vector is expanded to 4 channels, and 12 channels are used to obtain the sample feature vector after the channel adjustment; if the number of channels of the sample feature vector is 2, the target number is 6, and in this case, each channel in the sample feature vector is expanded to 6 channels, and 12 channels are used.

For example, the sample feature vector may be subjected to convolution processing, so that each channel in the sample feature vector is extended to a target number of channels, to obtain the sample feature vector with the channels adjusted, where the convolution processing may be implemented using conv1d (One Dimensional Convolution, one-dimensional convolution layer).

It can be seen that, in the above manner, no matter how many channels are included in the sample feature vector, the number of channels of the obtained sample feature vector after channel adjustment is the same, and the number of channels is the preset number of channels, that is, no matter whether feature information is missing, the sample feature vector after channel adjustment with uniform number of channels can be generated in the above manner. In addition, it can be seen that the more the number of terms of the missing information in the feature information is, the larger the target number is, that is, the more the number of channels corresponding to the channels after expansion is corresponding to each missing term, the larger the ratio of the total number of channels in the sample feature vector after channel adjustment is, and the larger the influence on the malicious flow identification result is, but the number of channels corresponding to the different missing terms after expansion is the target number, that is, the equal-ratio expansion is performed on each channel, and the situation that the ratio of the sample feature vector after channel adjustment is different for different channels after expansion is not caused.

In addition, the larger the number of preset channels, the larger the quotient of the number of preset channels and the number of channels of the sample feature vector, that is, the larger the target number, the more channels of the sample feature vector after the channel adjustment obtained by expansion, which results in the larger the data volume processed when the sample feature vector after the channel adjustment is processed later, and the more calculation resources consumed by subsequent processing. To save computational resources, the preset number of channels may be set to the least common multiple of the consecutive natural number.

Specifically, the content described in step S203 may be performed in the data traffic identification model.

Step S204: and training a data flow identification model based on the sample feature vector after the channel adjustment.

The data traffic identification model is used for identifying whether the data traffic is malicious traffic or not.

Specifically, after the sample feature vector after the channel adjustment is obtained, the sample feature vector after the channel adjustment is used for continuing to train the data flow identification model, and finally, the identification result of whether the sample data flow is malicious flow is obtained.

The subsequent training process may include data splicing of the channels of the sample feature vector after the channel adjustment to obtain a spliced feature vector, activating the spliced feature vector, pooling the activated feature vector, convolving the pooled feature vector, activating the convolved feature vector, fully connecting the obtained feature vector, and performing two-classification processing on the fully connected processing result. Specifically, a classification function may be used to perform classification processing, for example, the classification function may be a sigmoid (a mathematical function with an S-shaped curve) function, to obtain a predicted value for the sample data traffic, where the predicted value indicates a probability that the sample data traffic is malicious. If the predicted value is greater than a preset threshold, the sample data flow can be identified as malicious flow, for example, the preset threshold is 0.5 or 0.6, if the preset threshold is 0.5, the sample data flow is identified as malicious flow if the predicted value is greater than 0.5, otherwise, the sample data flow is identified as normal flow; if the preset threshold is 0.6, if the predicted value is greater than 0.6, the sample data flow is identified as malicious flow, otherwise, the sample data flow is identified as normal flow.

Specifically, after obtaining the recognition result of whether the sample data traffic is malicious traffic, calculating the sample loss of the data traffic recognition model based on the recognition result, performing parameter adjustment on the data traffic recognition model based on the sample loss, and then returning to execute the step of expanding each channel in the sample feature vector to a target number of channels to obtain the channel-adjusted sample feature vector to train the data traffic recognition model based on the channel-adjusted sample feature vector until a preset training termination condition is reached, thereby obtaining a trained data traffic recognition model.

Specifically, the above-described sample loss is calculated using a loss function. In one embodiment of the invention, a loss function of BCE With Logits Loss (Binary Cross Entropy With Logits Loss, binary cross entropy loss function) is used. In addition, parameters of the data traffic recognition model may be adjusted based on the sample loss. In one embodiment of the present invention, stochastic Gradient Descent (random gradient descent) may be used to calculate the data traffic identification model gradient based on the sample loss, which is used as a weight update to adjust parameters of the data traffic identification model.

And continuously adjusting parameters of the data flow identification model according to the description until a preset training termination condition is reached. Specifically, the preset training termination condition may be a preset training number or a preset number of sample data flows. And under the condition that the preset training termination condition is reached, the trained data flow identification model is considered to be obtained.

In one embodiment of the invention, a sample feature vector is input into the data traffic recognition model one at a time.

Since the number of channels of the input sample feature vector is the same as the number of items of the feature information of the sample data flow, and since there may be a case where the feature information is missing, the number of items of the feature information of each sample data flow may be different, and thus the number of channels of each sample feature vector may be different. Since the data flow identification model expands each channel into a target number of channels when data processing is performed, the target number can change along with the number of channels of the sample feature vector, that is, the processing modes of the data flow identification model on different sample feature vectors may be different. In order to adapt to the situation, only one sample feature vector is input into the data flow identification model at a time during training so as to train the data flow identification model.

Specifically, the content described in step S204 may be performed in the data traffic identification model.

In the existing data flow identification model training method, there is a requirement on the dimension of the feature information of the acquired sample data flow, for example, if the sample feature information with the dimension being a preset value can only be processed, if the dimension of the acquired sample feature information does not meet the preset value, some parameter settings in the existing data flow identification model training method need to be adjusted, so that the identification process of the sample data flow becomes complicated. To solve this problem, the embodiment of the present invention provides the embodiment shown in fig. 3 below on the basis of the embodiment shown in fig. 2.

Referring to fig. 3, a flow chart of a second data traffic recognition model training method according to an embodiment of the present invention is shown, and compared with the embodiment shown in fig. 2, the foregoing step S203 may be implemented by the following steps S203A to S203C.

Step S203A: and expanding each channel in the sample characteristic vector to a target number of channels.

Specifically, the manner of expanding each channel in the sample feature vector to the target number of channels is the same as that described in step S203, and will not be described herein.

Step S203B: and carrying out preset feature vector processing on the feature vector after the channel expansion to obtain a processing result.

Specifically, the feature vector after the channel expansion refers to the feature vector obtained in step S203A. The preset feature vector processing may include, for example, performing data splicing on each channel of the feature vector after channel expansion to obtain a spliced feature vector, performing activation processing on the spliced feature vector, performing pooling processing on the activated feature vector, performing convolution processing on the pooled feature vector, performing activation processing on the feature vector after convolution processing, and the like. The activation process may be implemented using relu (Linear rectification function ), the convolution process may be implemented using conv1d, and the pooling process may be a maximum pooling process.

Specifically, the manner of maximally pooling the feature vectors described above may be seen in fig. 4. As can be seen from fig. 4, the maximum pooling refers to taking the maximum value in the local accept domain, i.e. in fig. 4, each local accept domain contains two values of 0.5 and 1, and the result of pooling the local accept domain is 1.

In one embodiment of the present invention, the number of channels is 12, and the number of channels of the obtained feature vector after channel expansion is 12, and if the dimension of each channel is 467, the feature vector after channel expansion may be represented as [12, 467], and in the process of pooling the feature vector, the feature vector may be transposed, that is, from [12, 467] to [467, 12]; then, carrying out pooling on the 12 channel numbers, wherein the obtained pooling result is represented as [467,2] so as to achieve the purpose of channel compression; finally, vector transposition is carried out again to obtain a pooling result, and the characteristic vector obtained after pooling is represented as [2, 467].

Step S203C: pyramid pooling is carried out on the processing result to obtain a one-dimensional vector which is used as a sample characteristic vector after channel adjustment.

Pyramid pooling the processing results means that the processing results are gradually pooled in a layered pyramid form, in the process of gradually pooling in a layered pyramid form, the pooling granularity is gradually scaled along with the increase of the layer number, the pooled kernel value and the stride value are updated along with the gradual pooling, and the pooled results of each layer are flattened into one dimension and combined together, so that the one-dimensional vector is obtained.

Specifically, the dimension of the kernel used in each layer of pooling in the pyramid pooling process is matched with the dimension of the processed feature vector, the dimension of the kernel used in the first layer of pooling is the same as the dimension of the processed feature vector, the dimension of the processed result obtained after the first layer of pooling is 1, the dimension of the kernel used in the second layer of pooling is half of the dimension of the processed feature vector, the dimension of the processed result obtained after the first layer of pooling is 4, the dimension of the kernel used in the third layer of pooling is one fourth of the dimension of the processed feature vector, and the dimension of the processed result obtained after the third layer of pooling is 16. And by analogy, the dimension of the core used in the process of pooling of each layer is half of the dimension of the core used in the process of pooling of the last layer, and the dimension of the result obtained after pooling of each layer is 4 times of the dimension of the result obtained after pooling of the last layer. And respectively splicing the results obtained after pooling of each layer to obtain a one-dimensional vector finally output by pyramid pooling.

If the pyramid layer number is 3, pyramid pooling is performed on the processing result, and the dimension of the obtained one-dimensional vector is 21=1+4+16, wherein 1 is the dimension of the first layer pooled result in Jin Dachi, 4 is the dimension of the second layer pooled result, and 16 is the dimension of the third layer pooled result.

If the pyramid layer number is 5, pyramid pooling is performed on the processing result, and the dimension of the obtained one-dimensional vector is 341=1+4+16+64+256, wherein 1 is the dimension of the first layer pooled result in Jin Dachi, 4 is the dimension of the second layer pooled result, 16 is the dimension of the third layer pooled result, 64 is the dimension of the fourth layer pooled result, and 256 is the dimension of the fifth layer pooled result.

It can be seen that the dimension of the processed data is not limited when the pyramid pooling is performed on the processing result, for example, whether the processing result is a vector in the form of [32, 232], a vector in the form of [32, 400], or a vector in the form of [12, 400], as long as the pyramid layer number is set to 5, the obtained one-dimensional vector is represented as [1, 341], that is, the dimension of the one-dimensional vector that can be fixedly output by the pyramid pooling is determined, and the obtained one-dimensional vector is used as the sample feature vector after channel adjustment to perform the subsequent steps.

In an embodiment of the present invention, the sample feature vector after the channel adjustment may be subjected to a full-connection process, and then the full-connection process result may be subjected to a two-classification process, where the manner of performing the two-classification process may refer to the content described in the foregoing step S203, and details thereof are not repeated herein.

From the above, in the solution provided by the embodiment of the present invention, the dimension of the feature information of the acquired sample data flow is not required to be limited, and under the condition of the pyramid layer number setting, the pyramid pooling can be used to obtain the sample feature vector with the fixed dimension after the channel adjustment, so that the subsequent identification process of the sample data flow is continued. Therefore, the scheme provided by the embodiment of the invention can be compatible with sample characteristic information with different dimensions, and configuration parameters in the data flow identification model training method do not need to be modified under the condition that the dimensions of the sample characteristic information are different.

Referring to fig. 5, a flow chart of a third data traffic recognition model training method according to an embodiment of the present invention is shown, and compared with the embodiment shown in fig. 2, the foregoing step S203 may be implemented by the following step S203D.

Step S203D: and aiming at each channel in the sample feature vector, expanding the channel to a preset channel number, compressing the preset channel number obtained by expansion into a target channel number, and obtaining the sample feature vector with the channel adjusted.

As can be seen from the description in fig. 2, in the case where the sample characteristic information is at least one of TCP information, TLS information, DNS information, and HTTP information of the sample data traffic, the number of the above-mentioned preset channels may be 12, 24, 48, or the like. Taking the preset number of channels as 12 as an example, if the number of channels of the sample feature vector is 4, expanding each channel in the sample feature vector into 12 channels, and compressing the 12 channels obtained by expanding each channel into 3 channels.

In one embodiment of the present invention, each channel in the sample feature vector may be expanded into a preset number of channels by a one-dimensional convolution method, and then compressed by a channel pooling method.

As can be seen from the above, by applying the scheme provided by the embodiment of the present invention, each channel in the sample feature vector can be expanded to a target number of channels, so that the number of channels of the sample feature vector after channel adjustment obtained finally is a preset number of channels, that is, whether there is a defect in the sample feature information or not, the sample feature vector generated based on the sample feature information can be expanded to the sample feature vector after channel adjustment with the same number of channels, and the subsequent operation is performed.

Referring to fig. 6, a flowchart of a fourth data traffic recognition model training method according to an embodiment of the present invention is shown. As can be seen from fig. 6, for the sample feature vector, each channel in the sample feature vector is first extended to a preset number of channels by using a one-dimensional convolution layer, then the preset number of channels obtained by extending each channel is subjected to channel pooling, the preset number of channels is compressed to a target number of channels, and then the preprocessed feature vector is obtained by channel stitching. And obtaining a predicted value through a series of processing of linear rectification functions such as relu, max pooling, one-dimensional convolution layer, linear rectification functions such as relu, space pyramid pooling, full-connection layer and a mathematical function with an S-shaped curve such as sigmoid, and identifying whether the sample data flow is malicious or not based on the predicted value.

In one embodiment of the present invention, a trained data traffic recognition model is used to recognize a real data traffic to be recognized in network communication, and whether the data traffic to be recognized is malicious traffic is recognized in a manner of recognizing whether the sample data traffic is malicious traffic. Specifically, whether the data traffic to be identified is malicious traffic may be identified through the following steps a to D.

Step A: and acquiring the characteristic information to be identified of the data flow to be identified.

And (B) step (B): and generating the feature vector to be identified of the data flow to be identified based on the feature information to be identified.

The number of channels of the feature vector to be identified is the same as the number of items of the feature information to be identified.

Step C: and expanding each channel in the feature vector to be identified to a target number of channels to obtain the feature vector with the channels adjusted.

The target number is the quotient of the preset channel number and the channel number of the feature vector to be identified, the preset channel number is a common multiple of a continuous natural number, and the maximum value of the continuous natural number is the channel number of the feature vector to be identified under the condition that no feature loss exists.

Step D: and processing the feature vector after the channel adjustment by using the trained data flow identification model to obtain an identification result which indicates whether the data flow to be identified is malicious flow.

In addition, the above step C may be implemented by the following steps C1 to C3.

Step C1: and expanding each channel in the feature vector to be identified to a target number of channels.

Step C2: and carrying out preset feature vector processing on the feature vector after the channel expansion to obtain a processing result.

Step C3: pyramid pooling is carried out on the processing result to obtain a one-dimensional vector which is used as a characteristic vector after channel adjustment.

Specifically, the specific implementation manner of the steps a to D and the steps C1 to C3 may refer to the content described in the foregoing embodiments, and will not be repeated herein.

Specifically, the data flow to be identified may be a data flow processed by an network device in a time period, and the duration of the time period may be set to 3 minutes, 5 minutes or 10 minutes, which may be set according to actual needs. In particular, since the DNS information of the data traffic to be identified is generally smaller, in one embodiment of the present invention, the DNS information of the data traffic in the history may be retained, for example, the DNS information of the data traffic in the last day, or the DNS information of the data traffic in the last three days, etc. Specifically, the corresponding relation between the destination IP of each data flow and the DNS information of the data flow in the history is recorded, where the corresponding relation may be represented in the form of a DNS mapping table, where the DNS mapping table includes two columns, one column is the destination IP of the data flow and one column is the DNS information of the data flow, so that the corresponding DNS information can be found in the DNS mapping table only according to the destination IP of the data flow to be identified, and the found DNS information is used as a part of the DNS information of the data flow to be identified, so as to enlarge the data volume of the DNS information of the data flow to be identified.

Corresponding to the data flow identification model training method, the embodiment of the invention also provides a data flow identification model training device.

Referring to fig. 7, a schematic structural diagram of a training device for a data traffic recognition model according to an embodiment of the present invention is provided, where the device includes:

an obtaining module 701, configured to obtain sample characteristic information of a sample data flow.

A generating module 702, configured to generate, based on the sample feature information, a sample feature vector of the sample data flow, where the number of channels of the sample feature vector is the same as the number of terms of the sample feature information.

And an expansion module 703, configured to expand each channel in the sample feature vector to a target number of channels, to obtain a channel-adjusted sample feature vector, where the target number is a quotient of a preset number of channels and a number of channels of the sample feature vector, the preset number of channels is a common multiple of a continuous natural number, and a maximum value of the continuous natural number is the number of channels of the sample feature vector in the case that no feature loss exists.

And a training module 704, configured to train a data traffic identification model based on the channel-adjusted sample feature vector, where the data traffic identification model is used to identify whether the data traffic is malicious.

In one embodiment of the present invention, the expansion module 703 is specifically configured to:

In one embodiment of the present invention, the sample characteristic information is at least one of TCP information, TLS information, DNS information, and HTTP information of the sample data traffic.

In one embodiment of the present invention, the data traffic recognition model is trained using one of the above-described channel-adjusted sample feature vectors at a time.

Referring to fig. 8, a schematic structural diagram of a training electronic device for a data traffic recognition model according to an embodiment of the present invention includes a processor 801, a communication interface 802, a memory 803, and a communication bus 804, where the processor 801, the communication interface 802, and the memory 803 complete communication with each other through the communication bus 804;

a memory 803 for storing a computer program;

the processor 801 is configured to implement any one of the foregoing data traffic identification model training methods when executing the program stored in the memory 803.

The communication bus mentioned above for the electronic devices may be a peripheral component interconnect standard (Peripheral Component Interconnect, PCI) bus or an extended industry standard architecture (Extended Industry Standard Architecture, EISA) bus, etc. The communication bus may be classified as an address bus, a data bus, a control bus, or the like. For ease of illustration, the figures are shown with only one bold line, but not with only one bus or one type of bus.

The communication interface is used for communication between the electronic device and other devices.

The Memory may include random access Memory (Random Access Memory, RAM) or may include Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the aforementioned processor.

The processor may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), etc.; but also digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.

In yet another embodiment of the present invention, a computer readable storage medium is provided, in which a computer program is stored, which when executed by a processor, implements the steps of any of the above-described data traffic recognition model training methods.

When the computer program stored in the computer readable storage medium provided by the embodiment of the invention is used for training the data flow identification model, because the target number is the quotient of the preset channel number and the channel number in the sample feature vector, after each channel in the sample feature vector is expanded to the target number of channels, the total channel number of the sample feature vector after the channel adjustment is obtained is the preset channel number. And, because the number of the preset channels is a common multiple of the continuous natural number, and the maximum value of the continuous natural number is the number of channels of the sample feature vector under the condition that no feature loss exists, no matter how many items of the feature information of the sample data flow are, the total number of channels of the sample feature vector after channel adjustment is always the number of the preset channels, that is, the sample data flow with the feature information loss can also be processed, so that the sample feature vector after channel adjustment with the total number of channels being the number of the preset channels is obtained, and the subsequent operation is performed. In the process, the sample data flow characteristics are not supplemented with 0, or the sample data flow with characteristic missing is directly filtered, the sample data flow is completely identified based on the characteristic information of the acquired sample data flow, and the identification accuracy of the data flow identification model obtained through training on malicious flow can be improved.

In yet another embodiment of the present invention, a computer program product containing instructions that, when run on a computer, cause the computer to perform the data traffic recognition model training method of any of the above embodiments is also provided.

When the computer program product provided by the embodiment of the invention is used for training the data flow identification model, because the target number is the quotient of the preset channel number and the channel number in the sample feature vector, after each channel in the sample feature vector is expanded to the target number of channels, the total channel number of the sample feature vector after the channel adjustment is obtained is the preset channel number. And, because the number of the preset channels is a common multiple of the continuous natural number, and the maximum value of the continuous natural number is the number of channels of the sample feature vector under the condition that no feature loss exists, no matter how many items of the feature information of the sample data flow are, the total number of channels of the sample feature vector after channel adjustment is always the number of the preset channels, that is, the sample data flow with the feature information loss can also be processed, so that the sample feature vector after channel adjustment with the total number of channels being the number of the preset channels is obtained, and the subsequent operation is performed. In the process, the sample data flow characteristics are not supplemented with 0, or the sample data flow with characteristic missing is directly filtered, the sample data flow is completely identified based on the characteristic information of the acquired sample data flow, and the identification accuracy of the data flow identification model obtained through training on malicious flow can be improved.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present invention, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another, for example, by wired (e.g., coaxial cable, optical fiber, digital Subscriber Line (DSL)), or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid State Disk (SSD)), etc.

It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

In this specification, each embodiment is described in a related manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for apparatus, electronic devices, computer readable storage media and computer program product embodiments, the description is relatively simple as it is substantially similar to method embodiments, as relevant points are found in the partial description of method embodiments.

The foregoing description is only of the preferred embodiments of the present invention and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention are included in the protection scope of the present invention.

Claims

1. A method for training a data traffic recognition model, the method comprising:

acquiring sample characteristic information of sample data flow;

expanding each channel in the sample feature vector to a target number of channels to obtain a channel-adjusted sample feature vector, wherein the target number is a quotient of a preset channel number and the channel number of the sample feature vector, the preset channel number is a common multiple of a continuous natural number, and the maximum value of the continuous natural number is the channel number of the sample feature vector under the condition that no sample feature is missing;

2. The method of claim 1, wherein expanding each channel in the sample feature vector to a target number of channels to obtain a channel-adjusted sample feature vector comprises:

3. The method of claim 1, wherein said expanding each channel in the sample feature vector to a target number of channels comprises:

4. The method of claim 1, wherein the sample characteristic information is at least one of transmission control protocol, TCP, information, security transport layer protocol, TLS, information, domain name system, DNS, information, and hypertext transfer protocol, HTTP, information for the sample data traffic.

5. The method of any one of claims 1-4, wherein the data traffic recognition model is trained using one of the channel-adjusted sample feature vectors at a time.

6. A data traffic recognition model training apparatus, the apparatus comprising:

the expansion module is used for expanding each channel in the sample feature vector to a target number of channels to obtain a sample feature vector with channels adjusted, wherein the target number is a quotient of a preset channel number and the channel number of the sample feature vector, the preset channel number is a common multiple of a continuous natural number, and the maximum value of the continuous natural number is the channel number of the sample feature vector under the condition that no sample feature is missing;

7. The apparatus of claim 6, wherein the expansion module is specifically configured to:

8. The apparatus of claim 6, wherein the expansion module is specifically configured to:

9. The apparatus of claim 6, wherein the sample characteristic information is at least one of transmission control protocol, TCP, information, security transport layer protocol, TLS, information, domain name system, DNS, information, and hypertext transfer protocol, HTTP, information for the sample data traffic.

10. The apparatus of any of claims 6-9, wherein the data traffic recognition model is trained using one of the channel-adjusted sample feature vectors at a time.

11. The data traffic recognition model training electronic device is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus;

a memory for storing a computer program;

a processor for implementing the method steps of any one of claims 1-5 when executing a program stored on a memory.

12. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored therein a computer program which, when executed by a processor, implements the method steps of any of claims 1-5.