CN112637292A - Data processing method and device, electronic equipment and storage medium - Google Patents

Data processing method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN112637292A
CN112637292A CN202011468191.6A CN202011468191A CN112637292A CN 112637292 A CN112637292 A CN 112637292A CN 202011468191 A CN202011468191 A CN 202011468191A CN 112637292 A CN112637292 A CN 112637292A
Authority
CN
China
Prior art keywords
data
packet
information
data packet
traffic data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011468191.6A
Other languages
Chinese (zh)
Other versions
CN112637292B (en
Inventor
张英华
柴智
王斌
刘慧�
李萌
孟令栋
龚晓雪
杜永刚
于宝彤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China United Network Communications Group Co Ltd
Original Assignee
China United Network Communications Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China United Network Communications Group Co Ltd filed Critical China United Network Communications Group Co Ltd
Priority to CN202011468191.6A priority Critical patent/CN112637292B/en
Publication of CN112637292A publication Critical patent/CN112637292A/en
Application granted granted Critical
Publication of CN112637292B publication Critical patent/CN112637292B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/02Protocols based on web technology, e.g. hypertext transfer protocol [HTTP]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/14Network analysis or design
    • H04L41/145Network analysis or design involving simulating, designing, planning or modelling of a network
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/04Network architectures or network communication protocols for network security for providing a confidential data exchange among entities communicating through data packet networks
    • H04L63/0428Network architectures or network communication protocols for network security for providing a confidential data exchange among entities communicating through data packet networks wherein the data content is protected, e.g. by encrypting or encapsulating the payload
    • H04L63/0435Network architectures or network communication protocols for network security for providing a confidential data exchange among entities communicating through data packet networks wherein the data content is protected, e.g. by encrypting or encapsulating the payload wherein the sending and receiving network entities apply symmetric encryption, i.e. same key used for encryption and decryption
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1425Traffic logging, e.g. anomaly detection
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L9/00Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols
    • H04L9/32Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols including means for verifying the identity or authority of a user of the system or for message authentication, e.g. authorization, entity authentication, data integrity or data verification, non-repudiation, key authentication or verification of credentials
    • H04L9/3263Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols including means for verifying the identity or authority of a user of the system or for message authentication, e.g. authorization, entity authentication, data integrity or data verification, non-repudiation, key authentication or verification of credentials involving certificates, e.g. public key certificate [PKC] or attribute certificate [AC]; Public key infrastructure [PKI] arrangements

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Computer Hardware Design (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The application provides a data processing method, a data processing device, an electronic device and a storage medium, wherein the method comprises the following steps: obtaining a flow data packet to be processed; preprocessing the flow data packet to obtain configuration information, packet length and time sequence information, byte distribution information and non-encrypted data header information corresponding to the flow data packet; and inputting configuration information, packet length and time sequence information, byte distribution information and unencrypted data header information corresponding to the flow data packet into a preset data classification model, and outputting a classification result of the flow data packet. The method comprises the steps of preprocessing an acquired HTTPS flow data packet to obtain unencrypted data information of the HTTPS flow data packet, and inputting the unencrypted data information into a preset data classification model to identify and classify the HTTPS flow data packet. The technical problem that a DPI resolver in the prior art cannot identify and classify flow data of an HTTPS protocol is solved.

Description

Data processing method and device, electronic equipment and storage medium
Technical Field
The present application relates to the field of traffic identification technologies, and in particular, to a data processing method and apparatus, an electronic device, and a storage medium.
Background
The identification and classification of flow data are effective network management means, but with the development of internet technology, network transmission protocols become diverse, wherein a secure Socket hypertext Transfer Protocol (HTTPS) is an encryption Protocol for ensuring secure transmission of web page data, and the HTTPS is an addition of TLS/SSL Protocol secure nesting on the basis of a hypertext Transfer Protocol (HTTP). At present, in order to ensure the security of data transmission, a plurality of websites generally adopt an HTTPS protocol for data transmission, and therefore, it is of great significance to realize the identification and classification of HTTPS traffic data.
In the prior art, traffic data is identified and classified by a Deep Packet Inspection (DPI) analyzer to manage the traffic data, so as to comprehensively understand the operation condition of each network application.
However, when the HTTPS protocol is used for data transmission, traffic data is encrypted, and the current DPI parser cannot identify and classify the traffic data.
Disclosure of Invention
The application provides a data processing method, a data processing device, an electronic device and a storage medium, which are used for solving the technical problem that a DPI (deep packet inspection) analyzer in the prior art cannot identify and classify flow data of an HTTPS (hypertext transfer protocol secure) protocol.
A first aspect of the present application provides a data processing method, including:
obtaining a flow data packet to be processed;
preprocessing the flow data packet to obtain configuration information, packet length and time sequence information, byte distribution information and non-encrypted data header information corresponding to the flow data packet;
and inputting configuration information, packet length and time sequence information, byte distribution information and unencrypted data header information corresponding to the flow data packet into a preset data classification model, and outputting a classification result of the flow data packet.
Optionally, preprocessing the traffic data packet to obtain configuration information, packet length and time sequence information, byte distribution information, and unencrypted header information corresponding to the traffic data packet, includes:
adopting a class selector to perform class extraction processing on the traffic data packet to obtain traffic data of the traffic data packet under different classes;
and carrying out vector processing on the flow data under different types to obtain configuration information, packet length and time sequence information, byte distribution information and non-encrypted data header information corresponding to the flow data packets.
Optionally, for the traffic data of the types clisusites, CliExtensions, SerSuite, and SerExtension, vector processing is performed on the traffic data of each type by adopting a W2V algorithm to obtain a multidimensional vector corresponding to the traffic data of each type;
and integrating the multidimensional vectors to obtain multidimensional vectors after dimensionality integration, wherein the multidimensional vectors after dimensionality integration are used as the unencrypted data header information.
Optionally, obtaining packet length and time sequence information corresponding to the traffic data packet according to the traffic data under different categories, including:
and aiming at the traffic data under the packet length and time sequence categories, carrying out vector processing on the traffic data by adopting a Markov chain algorithm to obtain vectors corresponding to the packet length and the time sequence categories, wherein the vectors are used as packet length and time sequence information corresponding to the traffic data packets.
Optionally, obtaining byte distribution information corresponding to the traffic data packet according to the traffic data under different categories, including:
and counting the byte distribution frequency of the flow data under the byte distribution category, obtaining the occurrence frequency of the bytes, and taking the frequency as byte distribution information.
Optionally, the method further comprises:
constructing a data classification model, and acquiring and obtaining a training sample; the training sample comprises a plurality of sample flow data and data classification obtained by labeling each sample flow data;
and training the data classification model by using the training sample, and taking the trained data classification model as the preset data classification model.
A second aspect of the present application provides a data processing apparatus comprising:
the acquisition module is used for acquiring a flow data packet to be processed;
the processing module is used for preprocessing the flow data packet to obtain configuration information, packet length and time sequence information, byte distribution information and unencrypted data header information corresponding to the flow data packet;
and the identification module is used for inputting the configuration information, the packet length and time sequence information, the byte distribution information and the unencrypted data header information corresponding to the flow data packet into a preset data classification model and outputting the classification result of the flow data packet.
Optionally, the processing module is specifically configured to perform class extraction processing on the traffic data packet by using a class selector, so as to obtain traffic data of the traffic data packet in different classes; and carrying out vector processing on the flow data under different types to obtain configuration information, packet length and time sequence information, byte distribution information and non-encrypted data header information corresponding to the flow data packets.
Optionally, the processing module is specifically configured to perform vector processing on each type of flow data by using a W2V algorithm for the types of flow data of clisusites, CliExtensions, SerSuite, and SerExtension, and obtain a multidimensional vector corresponding to each type of flow data;
and integrating the multidimensional vectors to obtain multidimensional vectors after dimensionality integration, wherein the multidimensional vectors after dimensionality integration are used as the unencrypted data header information.
The processing module is specifically configured to perform vector processing on the traffic data in the packet length and time sequence categories by using a markov chain algorithm to obtain vectors corresponding to the packet length and the time sequence categories, where the vectors are used as packet length and time sequence information corresponding to the traffic data packets.
The processing module is specifically configured to perform statistics on byte distribution frequency of the traffic data under the byte distribution category, obtain byte occurrence times, and use the times as byte distribution information.
The device also comprises a model training module, wherein the model training module is used for constructing a data classification model and acquiring and obtaining training samples; the training sample comprises a plurality of sample flow data and data classification obtained by labeling each sample flow data; and training the data classification model by using the training sample, and taking the trained data classification model as the preset data classification model.
A third aspect of the present application provides an electronic device, comprising: at least one processor and memory;
the memory stores computer-executable instructions;
the at least one processor executes computer-executable instructions stored by the memory to cause the at least one processor to perform the method as set forth in the first aspect above and in various possible designs of the first aspect.
A fourth aspect of the present application provides a computer-readable storage medium having stored thereon computer-executable instructions that, when executed by a processor, implement a method as set forth in the first aspect and various possible designs of the first aspect.
According to the data processing method, the data processing device, the electronic equipment and the storage medium, the traffic data packet to be processed is obtained; preprocessing the flow data packet to obtain configuration information, packet length and time sequence information, byte distribution information and non-encrypted data header information corresponding to the flow data packet; and inputting configuration information, packet length and time sequence information, byte distribution information and unencrypted data header information corresponding to the flow data packet into a preset data classification model, and outputting a classification result of the flow data packet. According to the data processing method provided by the scheme, the acquired HTTPS traffic data packet is preprocessed to obtain the unencrypted data information of the HTTPS traffic data packet, and the unencrypted data information is input into the preset data classification model to identify and classify the HTTPS traffic data packet.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings can be obtained by those skilled in the art according to these drawings.
FIG. 1 is a block diagram of a data processing system according to an embodiment of the present application;
fig. 2 is a schematic flowchart of a data processing method according to an embodiment of the present application;
FIG. 3 is a schematic flow chart illustrating an exemplary process for establishing a data connection between a client and a server according to an embodiment of the present application;
fig. 4 is a schematic flow chart of another data processing method according to an embodiment of the present application;
fig. 5 is a schematic flowchart of a data processing method according to an embodiment of the present application;
FIG. 6 is a schematic flow chart diagram illustrating an exemplary data processing method according to an embodiment of the present application;
fig. 7 is a schematic flowchart of another data processing method according to an embodiment of the present application;
fig. 8 is a schematic flowchart of another data processing method according to an embodiment of the present application;
fig. 9 is a schematic flowchart of a data processing method according to an embodiment of the present application;
FIG. 10 is a block diagram of an exemplary data classification model provided in an embodiment of the present application;
FIG. 11 is a schematic overall flow chart of an exemplary data processing method according to an embodiment of the present application;
fig. 12 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application;
fig. 13 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
With the above figures, there are shown specific embodiments of the present application, which will be described in more detail below. These drawings and written description are not intended to limit the scope of the disclosed concepts in any way, but rather to illustrate the concepts of the disclosure to those skilled in the art by reference to specific embodiments.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Furthermore, the terms "first", "second", etc. are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. In the description of the following examples, "plurality" means two or more unless specifically limited otherwise.
Currently, in order to ensure the security of data transmission, many websites generally adopt HTTPS protocol for data transmission. When data traffic needs to be identified and classified, in the prior art, traffic data is identified and classified through a DPI (deep packet inspection) analyzer so as to manage the traffic data, so that the operating conditions of each network application are comprehensively known. However, current DPI resolvers are unable to identify and classify traffic data of the HTTPS protocol.
In view of the above problems, the data processing method, the data processing apparatus, the electronic device, and the storage medium provided in the embodiments of the present application perform preprocessing on an acquired traffic data packet to obtain configuration information, packet length and time series information, byte distribution information, and unencrypted header information corresponding to the traffic data packet, input the configuration information, the packet length and time series information, the byte distribution information, and the unencrypted header information corresponding to the traffic data packet into a preset data classification model, and output a classification result of the traffic data packet. According to the data processing method provided by the scheme, the acquired HTTPS traffic data packet is preprocessed to obtain the unencrypted data information of the HTTPS traffic data packet, and the unencrypted data information is input into the preset data classification model to identify and classify the HTTPS traffic data packet.
The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present invention will be described below with reference to the accompanying drawings.
First, a network structure on which the present application is based will be explained:
the data processing method provided by the embodiment of the application is suitable for identifying and classifying HTTPS flow data. Fig. 1 is a schematic structural diagram of a data processing system based on an embodiment of the present application, where the system may include a data transmission device for performing data transmission, and a data processing device for collecting and processing a traffic data packet of the data transmission device. Specifically, the data processing device may pre-process the collected traffic data packet to obtain unencrypted data information thereof, identify and classify the traffic data packet according to the obtained unencrypted data information, and obtain a classification result thereof.
Example one
The present embodiment provides a data processing method, which is suitable for identifying and classifying HTTPS traffic data packets. The execution subject of the embodiment is the data processing device corresponding to the method.
As shown in fig. 2, a schematic flow chart of a data processing method provided in this embodiment is shown, where the method includes:
step 101, obtaining a traffic data packet to be processed.
Step 102, preprocessing the traffic data packet to obtain configuration information, packet length and time sequence information, byte distribution information and unencrypted data header information corresponding to the traffic data packet.
It should be noted that, in the process of establishing data connection between the client and the server, in the handshake phase of requesting and verifying a public key from the server at the client and generating a session key by negotiation between the client and the server, most of the communication data in this phase is plaintext data except for a small amount of generated random number keys.
And 103, inputting configuration information, packet length and time sequence information, byte distribution information and unencrypted data header information corresponding to the traffic data packet into a preset data classification model, and outputting a classification result of the traffic data packet.
The preset data classification model can be constructed based on a convolutional neural network, and model training is performed on the preset data classification model to obtain the preset data classification model with high accuracy.
Specifically, the configuration information, the packet length and time sequence information, the byte distribution information and the unencrypted header information corresponding to the traffic data packet are input to a preset data classification model, the traffic data packet is identified and classified based on the preset data classification model, and finally, a classification result corresponding to the traffic data packet is output.
Preferably, for the traffic data packet to be processed that needs to be obtained, in the process of establishing data connection between the client and the server, communication data between the client and the server may be obtained, where the communication data at this stage includes a large amount of plaintext data, and the obtained plaintext data is the traffic data packet to be processed that needs to be obtained.
Fig. 3 is a schematic flow chart illustrating a process of establishing a data connection between an exemplary client and a server according to this embodiment.
Illustratively, a client sends a request to a server. The server must have a set of digital certificates, and can be manufactured by itself or applied to an organization. The difference is that the certificate issued by the client needs to be verified by the client to be accessible continuously, and a prompt page cannot be popped up by using the certificate applied by the trusted company, and the set of certificate is a pair of a public key and a private key. The server transmits a certificate, namely a public key, to the client, wherein the certificate only contains a lot of information, such as the issuing organization of the certificate, the expiration time, the public key of the server, the signature of a third party Certificate Authority (CA), the domain name information of the server and the like. The client analyzes the certificate, the work is finished by a Transport Layer Security (TLS) of the client, whether a public key is valid or not is verified firstly, such as an issuing organization, expiration time and the like, if abnormity is found, a warning frame is popped up to prompt that the certificate has problems. If the certificate has no problem, a random value (key) is generated. The random value is then encrypted with the certificate. And transmitting the encrypted information, wherein the part of the transmitted information is the secret key encrypted by the certificate, so that the secret key is obtained by the server, and the communication between the client and the server can be encrypted and decrypted by the random value. The service section encrypts information, the server side decrypts the secret key by using the private key to obtain the private key transmitted by the client side, and then the content is symmetrically encrypted by the value. And transmitting the encrypted information, wherein the part of the information is the information encrypted by the server side by using a private key and can be restored at the client side. The client decrypts the information, and the client decrypts the information transmitted by the server by using the generated private key, so that the decrypted content is obtained.
Preferably, as an implementable manner, as shown in fig. 4, for a flow schematic diagram of another data processing method provided in this embodiment, the preprocessing is performed on the traffic data packet to obtain configuration information, packet length and time sequence information, byte distribution information, and unencrypted header information corresponding to the traffic data packet, and the method includes:
step 1021, adopting a classification selector to perform class extraction processing on the traffic data packet to obtain traffic data of the traffic data packet under different classes;
step 1022, performing vector processing on the traffic data in different categories to obtain configuration information, packet length and time sequence information, byte distribution information and unencrypted header information corresponding to the traffic data packet.
Wherein, the classification selector can be a DPI resolver.
Specifically, the DPI parser may be adopted to analyze and process the acquired traffic data packet to be processed according to the data category to extract plaintext data therein, process and classify the acquired plaintext data to obtain traffic data of the traffic data packet under each category, and store the traffic data under each category in a classified manner. And performing vector processing on the stored traffic data under each category to obtain configuration information, packet length and time sequence information, byte distribution information and unencrypted data header information corresponding to the traffic data packet. And finally, inputting configuration information, packet length and time sequence information, byte distribution information and unencrypted data header information corresponding to the flow data packet into a preset data classification model, identifying and classifying the flow data packet based on the preset data classification model, and obtaining a classification result.
Illustratively, as shown in the following table, the traffic data provided for the present embodiment includes main categories:
Figure BDA0002835281170000081
optionally, the traffic data packet is processed to determine a classification result thereof, and the operation condition of each network application is determined according to the classification result.
Further, the operator may provide a targeted traffic service to the user according to the determined operation condition of each network application. The operator can also legally monitor and manage each network application according to the determined running condition of each network application, so that the user is effectively prevented from being tricked by illegal websites or software, and the security of the network environment can be guaranteed.
In the data processing method provided in this embodiment, the obtained traffic data packet is preprocessed to obtain the unencrypted data information of the traffic data packet, and the unencrypted data information is input into the preset data classification model, so that the traffic data packet is identified and classified.
Example two
When the vector processing is performed on the traffic data under each category, different vector processing modes can be adopted for the traffic data under different categories in order to improve the vectorization effect of the traffic data under each category.
As shown in fig. 5, a flow chart of a data processing method provided in this embodiment is shown. As a practical manner, on the basis of the foregoing embodiment, optionally, obtaining the unencrypted header information corresponding to the traffic data packet according to the traffic data in different categories includes:
step 10221, performing vector processing on each type of flow data by adopting a W2V algorithm aiming at the types of flow data of CliSuites, CliExtensions, SerSuite and SerExtension to obtain a multidimensional vector corresponding to each type of flow data;
step 10222, integrating the multidimensional vectors to obtain multidimensional vectors with integrated dimensions, where the multidimensional vectors with integrated dimensions are used as unencrypted header information.
Optionally, in order to facilitate subsequent calculation and storage, the dimension reduction processing needs to be performed on the traffic data under each type, where the W2V algorithm may convert the binary coded traffic data in the categories clisusites, CliExtensions, SerSuite, and SerExtension into a hexadecimal coding form for storage, and form the hexadecimal coded traffic data in the categories clisusites, CliExtensions, SerSuite, and SerExtension into word libraries, and convert the word libraries into lower-dimensional vectors respectively, and finally store the lower-dimensional vectors corresponding to the traffic data under each category.
Specifically, based on the DPI parser, traffic data of CliSuites, CliExtensions, SerSuite, and SerExtension categories in the traffic data packet to be processed are extracted and classified and stored. Based on the W2V algorithm, the stored traffic data of CliSuites, CliExtensions, SerSuite and SerExtension are respectively subjected to vector processing to obtain the multidimensional vector corresponding to the traffic data of each category. And finally, integrating the multidimensional vectors to obtain multidimensional vectors after dimensionality integration, and taking the multidimensional vectors after dimensionality integration as unencrypted data header information.
As shown in fig. 6, a flowchart of an exemplary data processing method provided in this embodiment is schematically shown, where the flowchart shown in fig. 6 may be a specific implementation manner of the flowchart shown in fig. 5.
Exemplarily, firstly, the traffic data under the CliSuites category is subjected to vector processing based on the W2V algorithm to be converted into a 30-dimensional vector; vector processing is carried out on the flow data under the CliExtensions category based on the W2V algorithm so as to convert the flow data into 50-dimensional vectors; performing vector processing on the traffic data under the Sersuite category based on a W2V algorithm to convert the traffic data into a 1-dimensional vector; vector processing is carried out on the traffic data under the SerExtension category based on a W2V algorithm so as to convert the traffic data into 30-dimensional vectors; and finally, performing dimension integration on vectors corresponding to the traffic data of each category to obtain 111-dimensional vectors, and taking the obtained 111-dimensional vectors as unencrypted data header information.
Fig. 7 is a schematic flow chart of another data processing method provided in this embodiment. As a practical way, on the basis of the foregoing embodiment, optionally, obtaining packet length and time series information corresponding to a traffic data packet according to traffic data in different categories includes:
step 10223, performing vector processing on the traffic data in the packet length and time sequence category by using a markov chain algorithm to obtain a vector corresponding to the packet length and time sequence category, wherein the vector is used as packet length and time sequence information corresponding to the traffic data packet.
Illustratively, an array a with the length of 10 is constructed based on the Markov chain algorithm, and the subscript of each element in the array is i, that is, the array element in the array a is aiWherein a isiIndicates that the packet length in a stream is [ i + 150, (i +1) × 150]The number of packets in between. Wherein i is 0,1,2,3,4,5, …, 9. Similarly, an array b with the length of 10 is constructed, the subscript of each element in the array is j, that is, the array element in the array b is bjWherein b isjIndicates the packet length in a stream is [ j 150, (j +1) 150]The number of packets in between. Where j is 10,11,12,13, …, 19. And analogizing in sequence, finally processing at least one obtained array with the length of 10 to obtain a vector or a matrix corresponding to the traffic data under the packet length and time sequence category, and taking the vector or the matrix as the packet length and time sequence information corresponding to the traffic data packet.
Fig. 8 is a schematic flow chart of another data processing method provided in this embodiment. As an implementable manner, on the basis of the foregoing embodiments, optionally, obtaining byte distribution information corresponding to a traffic data packet according to traffic data in different categories includes:
step 10224, for the traffic data in byte distribution category, counting byte distribution frequency, and obtaining byte occurrence frequency, which is used as byte distribution information.
Illustratively, since payload in a data stream is stored in the form of bytes, and the range of data which can be represented by the bytes is between 0 and 255, according to the distribution condition of the data of the payload in the range of 0 to 255 in each stream, counting the byte distribution frequency of the data, obtaining the occurrence number of the bytes, and finally taking the number as byte distribution information.
Based on the foregoing embodiment, the data processing method provided in this embodiment adopts different vector processing manners for the traffic data in different categories, for example, obtains unencrypted header information corresponding to the traffic data packet based on the W2V algorithm, obtains packet length and time series information corresponding to the traffic data packet based on the markov chain algorithm, and obtains byte distribution information based on statistics on byte distribution frequency, thereby effectively improving vectorization effect of the traffic data in each category.
EXAMPLE III
In order to improve the accuracy of data processing, the classification result output by the preset classification model is more accurate.
As shown in fig. 9, a schematic flow chart of the data processing method provided in this embodiment is shown. As a practical manner, on the basis of the foregoing embodiments, optionally, the method further includes:
step 301, constructing a data classification model, and collecting and obtaining training samples; the training sample comprises a plurality of sample flow data and data classification obtained by labeling each sample flow data;
step 302, training the data classification model by using the training samples, and taking the trained data classification model as a preset data classification model.
Specifically, a data classification model is constructed based on a convolutional neural network, and a training sample is obtained, wherein the training sample comprises a plurality of sample flow data. The data classification of the multiple sample flow data is known, each sample flow data is labeled according to the known data classification, and finally the training sample is used for training the data classification model. When the accuracy of the classification result output by the data classification model reaches a preset identification standard, determining the current configuration parameter as an optimal configuration parameter, namely, completing the training of the data classification model, and taking the trained (under the optimal configuration parameter) data classification model as a preset data classification model.
As shown in fig. 10, a schematic structural diagram of an exemplary data classification model provided for this embodiment is shown. The data classification model comprises an input layer, a convolution layer, a pooling layer, a full-link layer and an output layer.
For example, a feature vector corresponding to each obtained unencrypted data information is input at the input layer, where the feature vector has an m × 1-dimensional structure, and the feature vector is obtained by processing unencrypted header information (111 dimensions), packet length and time series information (100 dimensions), byte distribution information (100 dimensions), and public key length (1 dimension), and for example, the feature vector may be:
[1,1,0,0,0,1,0,0,0,…1,0,1,0,0,0,1,227,…,253,…,252,…,239,…233,…,256]
since most of the vectors are sparse vectors, the libsvm format is used for storage, and the data classification labeling is performed on the feature vectors, for example, the first column is a feature type (1 represents that the feature is a certain network application):
[15|1700:1,1:1,2:1,3:1,4:1,5:1,6:1,7:1,8:1,9:1,10:1,11:1,12:1,266:1,267:1,268:1,269:1,270:1,271:1,272:1,310:1,328:1,329:1,330:1,331:1,332:1]
after the labeling of data classification of the feature vector is completed, feature extraction is performed using three convolutional layers (con _ layer1, con _ layer2, con _ layer 3). The layer is mainly used for performing convolution operation on the original feature vector and the convolution kernel to obtain a feature map (feature map).
The convolution kernel of three convolution layers is mainly selected from the convolution kernels of 3 × 1 or 5 × 1 sizes to perform convolution operation on the input vector (specifically, the convolution kernel needs to be adjusted according to actual effects). And after the operation is finished, carrying out nonlinear excitation on the result after the convolution operation by using an excitation function, wherein the excitation function used here is ReLu.
Three convolutional layers (pool _ layer1, pool _ layer2, pool _ layer3) were used for feature map dimension reduction. This is done using the Max-over-time pooling layer function, with one pooling layer behind each convolutional layer. The pooling layer is mainly calculated by extracting a plurality of characteristic values by using a filter, taking only the largest value as a pooling layer reserved value, discarding all other characteristic values, and the largest value represents that only the strongest characteristic of the characteristics is reserved, and discarding other weak characteristics. The method has the main function of performing dimensionality reduction processing on a feature map (high-dimensional vector) obtained by each convolution layer so as to facilitate subsequent calculation and storage. Assuming that a 4 x 1 filter is used for filtering, the result of this step is an m/4 x 1 dimensional feature vector.
Wherein the full connection layer (fc _ layer) is the full connection layer including dropout and the softmax classifier. The layer mainly receives the matrix output by pool _ layer3 and carries the matrix into a softmax classifier for calculation, so as to obtain a classification result. The layer is mainly used for connecting all the characteristics and sending output values to a classifier for classification and identification. The classification result is output in a vector format, the vector dimension is the number of the classes, and the element in the vector is the probability that the data is the corresponding class.
And finally, outputting a classification result by an output layer, verifying the output classification result based on a ten-fold cross verification and regularized logistic regression algorithm, determining the current model configuration parameter as the optimal configuration parameter when the classification result is determined to have higher accuracy, namely the classification result reaches a preset identification standard, namely the training work of the model is finished, and using the trained data classification model as a preset data classification model to identify and classify the flow data packet.
For example, as shown in fig. 11, which is a schematic overall flow chart of the exemplary data processing method provided in this embodiment, the flow chart shown in fig. 11 may be a specific implementation manner of the flow chart shown in fig. 2.
Illustratively, taking a data traffic packet as TLS data traffic, a TLS data traffic sample library is first created, the sample library comprises a plurality of flow data packets, each flow data packet is processed based on a DPI resolver, to obtain the non-encrypted data in each Flow data packet, and to perform class extraction processing on the non-encrypted data, and finally to obtain the uplink Byte number, downlink Byte number, uplink packet number, downlink packet number and duration in the Flow Metadata, the length Sequence and interval time Sequence in the Sequence of packets, and the Byte Distribution probability in the Byte Distribution, and inputting a TLS encryption suite, TLS extension and public key length in a TLS header (TLS header information) into a data classification model constructed by the convolutional neural network, and outputting a result based on the data classification model to obtain a classification result.
On the basis of the foregoing embodiments, the data processing method provided in this embodiment trains the constructed data classification model, and verifies the classification result output by the data classification model, so that the classification result output by the preset classification model is more accurate, and the reliability of the classification result is effectively improved.
Example four
The present embodiment provides a data processing apparatus for executing the data processing method provided in the above embodiment.
As shown in fig. 12, a schematic structural diagram of the data processing apparatus provided in this embodiment is shown. The data processing device 40 comprises an acquisition module 41, a processing module 42 and an identification module 43.
The system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a flow data packet to be processed; the processing module is used for preprocessing the flow data packet to obtain configuration information, packet length and time sequence information, byte distribution information and non-encrypted data header information corresponding to the flow data packet; and the identification module is used for inputting the configuration information, the packet length and time sequence information, the byte distribution information and the unencrypted data header information corresponding to the flow data packet into a preset data classification model and outputting the classification result of the flow data packet.
Optionally, the processing module is specifically configured to:
adopting a class selector to perform class extraction processing on the traffic data packet to obtain traffic data of the traffic data packet under different classes; and carrying out vector processing on the flow data under different types to obtain configuration information, packet length and time sequence information, byte distribution information and non-encrypted data header information corresponding to the flow data packets.
Optionally, the processing module is specifically configured to perform vector processing on each type of flow data by using a W2V algorithm for the types of flow data of clisusites, CliExtensions, SerSuite, and SerExtension, and obtain a multidimensional vector corresponding to each type of flow data; and integrating the multidimensional vectors to obtain multidimensional vectors after dimensionality integration, wherein the multidimensional vectors after dimensionality integration are used as non-encrypted data header information.
And the processing module is specifically used for carrying out vector processing on the traffic data under the packet length and time sequence categories by adopting a Markov chain algorithm to obtain vectors corresponding to the packet length and the time sequence categories, and the vectors are used as packet length and time sequence information corresponding to the traffic data packets.
And the processing module is specifically used for counting byte distribution frequency of the flow data under the byte distribution category, obtaining byte occurrence times and using the times as byte distribution information.
The device also comprises a model training module, a data classification model and a data classification model acquisition module, wherein the model training module is used for constructing the data classification model and acquiring training samples; the training sample comprises a plurality of sample flow data and data classification obtained by labeling each sample flow data; and training the data classification model by using the training samples, and taking the trained data classification model as a preset data classification model.
The specific manner in which the respective modules perform operations has been described in detail in relation to the apparatus in this embodiment, and will not be elaborated upon here.
The data processing apparatus provided in this embodiment may be configured to execute the data processing method provided in the foregoing embodiment, and the implementation manner and principle of the data processing apparatus are the same, and are not described in detail herein.
EXAMPLE five
The present embodiment provides an electronic device for executing the method provided by the above embodiment.
As shown in fig. 13, is a schematic structural diagram of the electronic device provided in this embodiment. The electronic device 50 includes: at least one processor 51 and memory 52;
the memory stores computer-executable instructions; the at least one processor executes computer-executable instructions stored by the memory to cause the at least one processor to perform a method as provided by any of the embodiments above.
The electronic device according to this embodiment may be configured to execute the data processing method provided in the foregoing embodiments, and the implementation manner and principle thereof are the same and will not be described again.
EXAMPLE six
The present embodiment provides a computer-readable storage medium, in which computer-executable instructions are stored, and when the processor executes the computer-executable instructions, the method provided in any one of the above embodiments is implemented.
According to the computer-readable storage medium of this embodiment, the computer-executable instructions of the data processing method provided in the foregoing embodiments can be stored, and the implementation manner is the same as the principle, and is not described again.
In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.
The integrated unit implemented in the form of a software functional unit may be stored in a computer readable storage medium. The software functional unit is stored in a storage medium and includes several instructions to enable a computer device (which may be a personal computer, a server, or a network device) or a processor (processor) to execute some steps of the methods according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
It is obvious to those skilled in the art that, for convenience and simplicity of description, the foregoing division of the functional modules is merely used as an example, and in practical applications, the above function distribution may be performed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules to perform all or part of the above described functions. For the specific working process of the device described above, reference may be made to the corresponding process in the foregoing method embodiment, which is not described herein again.
Finally, it should be noted that: the above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present application.

Claims (10)

1. A data processing method, comprising:
obtaining a flow data packet to be processed;
preprocessing the flow data packet to obtain configuration information, packet length and time sequence information, byte distribution information and non-encrypted data header information corresponding to the flow data packet;
and inputting configuration information, packet length and time sequence information, byte distribution information and unencrypted data header information corresponding to the flow data packet into a preset data classification model, and outputting a classification result of the flow data packet.
2. The data processing method of claim 1,
preprocessing the flow data packet to obtain configuration information, packet length and time sequence information, byte distribution information and non-encrypted data header information corresponding to the flow data packet, including:
adopting a class selector to perform class extraction processing on the traffic data packet to obtain traffic data of the traffic data packet under different classes;
and carrying out vector processing on the flow data under different types to obtain configuration information, packet length and time sequence information, byte distribution information and non-encrypted data header information corresponding to the flow data packets.
3. The data processing method according to claim 2, wherein obtaining the unencrypted header information corresponding to the traffic data packet according to the traffic data in different categories comprises:
for the traffic data of CliSuites, CliExtensions, SerSuite and SerExtension types, vector processing is carried out on each type of traffic data by adopting a W2V algorithm to obtain a multidimensional vector corresponding to each type of traffic data;
and integrating the multidimensional vectors to obtain multidimensional vectors after dimensionality integration, wherein the multidimensional vectors after dimensionality integration are used as the unencrypted data header information.
4. The data processing method according to claim 2, wherein obtaining packet length and time series information corresponding to the traffic data packet according to the traffic data in different categories comprises:
and aiming at the traffic data under the packet length and time sequence categories, carrying out vector processing on the traffic data by adopting a Markov chain algorithm to obtain vectors corresponding to the packet length and the time sequence categories, wherein the vectors are used as packet length and time sequence information corresponding to the traffic data packets.
5. The data processing method according to claim 2, wherein obtaining byte distribution information corresponding to the traffic data packets according to the traffic data in different categories comprises:
and counting the byte distribution frequency of the flow data under the byte distribution category, obtaining the occurrence frequency of the bytes, and taking the frequency as byte distribution information.
6. The data processing method according to any one of claims 1 to 5, further comprising:
constructing a data classification model, and acquiring and obtaining a training sample; the training sample comprises a plurality of sample flow data and data classification obtained by labeling each sample flow data;
and training the data classification model by using the training sample, and taking the trained data classification model as the preset data classification model.
7. A data processing apparatus, comprising:
the acquisition module is used for acquiring a flow data packet to be processed;
the processing module is used for preprocessing the flow data packet to obtain configuration information, packet length and time sequence information, byte distribution information and unencrypted data header information corresponding to the flow data packet;
and the identification module is used for inputting the configuration information, the packet length and time sequence information, the byte distribution information and the unencrypted data header information corresponding to the flow data packet into a preset data classification model and outputting the classification result of the flow data packet.
8. The data processing apparatus according to claim 7, wherein the processing module is specifically configured to:
adopting a class selector to perform class extraction processing on the traffic data packet to obtain traffic data of the traffic data packet under different classes; and carrying out vector processing on the flow data under different types to obtain configuration information, packet length and time sequence information, byte distribution information and non-encrypted data header information corresponding to the flow data packets.
9. An electronic device, comprising: at least one processor and memory;
the memory stores computer-executable instructions;
the at least one processor executing the computer-executable instructions stored by the memory causes the at least one processor to perform the method of any of claims 1-6.
10. A computer-readable storage medium having computer-executable instructions stored thereon which, when executed by a processor, implement the method of any one of claims 1 to 6.
CN202011468191.6A 2020-12-14 2020-12-14 Data processing method and device, electronic equipment and storage medium Active CN112637292B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011468191.6A CN112637292B (en) 2020-12-14 2020-12-14 Data processing method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011468191.6A CN112637292B (en) 2020-12-14 2020-12-14 Data processing method and device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN112637292A true CN112637292A (en) 2021-04-09
CN112637292B CN112637292B (en) 2022-11-22

Family

ID=75312894

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011468191.6A Active CN112637292B (en) 2020-12-14 2020-12-14 Data processing method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112637292B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114221874A (en) * 2021-12-14 2022-03-22 平安壹钱包电子商务有限公司 Traffic analysis and scheduling method and device, computer equipment and readable storage medium
CN114254171A (en) * 2021-12-20 2022-03-29 湖北天融信网络安全技术有限公司 Data classification method, model training method, device, terminal and storage medium
CN115001994A (en) * 2022-07-27 2022-09-02 北京天融信网络安全技术有限公司 Method, device, equipment and medium for classifying flow data packet

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102739457A (en) * 2012-07-23 2012-10-17 武汉大学 Network flow recognition system and method based on DPI (Deep Packet Inspection) and SVM (Support Vector Machine) technology
US20170374016A1 (en) * 2016-06-23 2017-12-28 Cisco Technology, Inc. Utilizing service tagging for encrypted flow classification
CN108768986A (en) * 2018-05-17 2018-11-06 中国科学院信息工程研究所 A kind of encryption traffic classification method and server, computer readable storage medium
CN109450740A (en) * 2018-12-21 2019-03-08 青岛理工大学 SDN controller for carrying out traffic classification based on DPI and machine learning algorithm
CN110213227A (en) * 2019-04-24 2019-09-06 华为技术有限公司 A kind of network data flow detection method and device
CN110784465A (en) * 2019-10-25 2020-02-11 新华三信息安全技术有限公司 Data stream detection method and device and electronic equipment
CN111277587A (en) * 2020-01-19 2020-06-12 武汉思普崚技术有限公司 Malicious encrypted traffic detection method and system based on behavior analysis
CN111901300A (en) * 2020-06-24 2020-11-06 武汉绿色网络信息服务有限责任公司 Method and device for classifying network traffic

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102739457A (en) * 2012-07-23 2012-10-17 武汉大学 Network flow recognition system and method based on DPI (Deep Packet Inspection) and SVM (Support Vector Machine) technology
US20170374016A1 (en) * 2016-06-23 2017-12-28 Cisco Technology, Inc. Utilizing service tagging for encrypted flow classification
CN108768986A (en) * 2018-05-17 2018-11-06 中国科学院信息工程研究所 A kind of encryption traffic classification method and server, computer readable storage medium
CN109450740A (en) * 2018-12-21 2019-03-08 青岛理工大学 SDN controller for carrying out traffic classification based on DPI and machine learning algorithm
CN110213227A (en) * 2019-04-24 2019-09-06 华为技术有限公司 A kind of network data flow detection method and device
CN110784465A (en) * 2019-10-25 2020-02-11 新华三信息安全技术有限公司 Data stream detection method and device and electronic equipment
CN111277587A (en) * 2020-01-19 2020-06-12 武汉思普崚技术有限公司 Malicious encrypted traffic detection method and system based on behavior analysis
CN111901300A (en) * 2020-06-24 2020-11-06 武汉绿色网络信息服务有限责任公司 Method and device for classifying network traffic

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
MATTI HIRVONEN ET AL: "Two-phased method for identifying SSH encrypted application flows", 《2011 7TH INTERNATIONAL WIRELESS COMMUNICATIONS AND MOBILE COMPUTING CONFERENCE》 *
李洋等: "基于深度报文检测和机器学习的加密流量识别方法", 《计算机产品与流通》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114221874A (en) * 2021-12-14 2022-03-22 平安壹钱包电子商务有限公司 Traffic analysis and scheduling method and device, computer equipment and readable storage medium
CN114221874B (en) * 2021-12-14 2023-11-14 平安壹钱包电子商务有限公司 Traffic analysis and scheduling method and device, computer equipment and readable storage medium
CN114254171A (en) * 2021-12-20 2022-03-29 湖北天融信网络安全技术有限公司 Data classification method, model training method, device, terminal and storage medium
CN114254171B (en) * 2021-12-20 2024-07-23 湖北天融信网络安全技术有限公司 Data classification method, model training method, device, terminal and storage medium
CN115001994A (en) * 2022-07-27 2022-09-02 北京天融信网络安全技术有限公司 Method, device, equipment and medium for classifying flow data packet

Also Published As

Publication number Publication date
CN112637292B (en) 2022-11-22

Similar Documents

Publication Publication Date Title
CN112637292B (en) Data processing method and device, electronic equipment and storage medium
EP3046286B1 (en) Information processing method, program, and information processing apparatus
CN112104570B (en) Traffic classification method, traffic classification device, computer equipment and storage medium
US9787647B2 (en) Secure computer evaluation of decision trees
US11335213B2 (en) Method and apparatus for encrypting data, method and apparatus for decrypting data
CN113542253B (en) Network flow detection method, device, equipment and medium
Bazuhair et al. Detecting malign encrypted network traffic using perlin noise and convolutional neural network
CN112217763A (en) Hidden TLS communication flow detection method based on machine learning
CN115865534B (en) Malicious encryption-based traffic detection method, system, device and medium
CN111553443A (en) Training method and device for referee document processing model and electronic equipment
CN117390657A (en) Data encryption method, device, computer equipment and storage medium
CN116383793A (en) Face data processing method, device, electronic equipment and computer readable medium
CN118094580A (en) Information security management system and method based on Internet of things
CN118473824A (en) Communication data real-time acquisition method, device, equipment and storage medium
CN113918977A (en) User information transmission device based on Internet of things and big data analysis
Liu et al. Spatial‐Temporal Feature with Dual‐Attention Mechanism for Encrypted Malicious Traffic Detection
CN113794687A (en) Malicious encrypted flow detection method and device based on deep learning
CN117313158A (en) Data processing method and device
CN117439799A (en) Anti-tampering method for http request data
CN115225365B (en) Data security transmission method, platform and system based on cryptographic algorithm
WO2023059501A1 (en) Statistically private oblivious transfer from cdh
Guo et al. MGEL: a robust malware encrypted traffic detection method based on ensemble learning with multi-grained features
Yang et al. MTSecurity: Privacy-Preserving Malicious Traffic Classification using Graph Neural Network and Transformer
François et al. Digital forensics in VoIP networks
Yang et al. PETNet: Plaintext-aware encrypted traffic detection network for identifying Cobalt Strike HTTPS traffics

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant