CN114254171A

CN114254171A - Data classification method, model training method, device, terminal and storage medium

Info

Publication number: CN114254171A
Application number: CN202111566209.0A
Authority: CN
Inventors: 谢鹏程; 李渊
Original assignee: Beijing Topsec Technology Co Ltd; Beijing Topsec Network Security Technology Co Ltd; Beijing Topsec Software Co Ltd; Hubei Topsec Network Security Technology Co Ltd
Current assignee: Beijing Topsec Technology Co Ltd; Beijing Topsec Network Security Technology Co Ltd; Beijing Topsec Software Co Ltd; Hubei Topsec Network Security Technology Co Ltd
Priority date: 2021-12-20
Filing date: 2021-12-20
Publication date: 2022-03-29

Abstract

The application provides a data classification method, a model training device, a terminal and a storage medium, which can determine occurrence frequency information corresponding to each preset possible value according to a combined selection result of sample data to be detected aiming at each single byte, determine a target data feature vector according to the occurrence frequency information corresponding to each preset possible value, determine whether the sample data to be detected is plaintext data or ciphertext data based on the target data feature vector and a preset plaintext-ciphertext data classification model, realize the identification and classification of network flow encrypted data and the plaintext data, construct the target data feature vector based on the occurrence frequency information, and have the advantages of simple construction mode and small operation amount.

Description

Data classification method, model training method, device, terminal and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a data classification method, a model training method, an apparatus, a terminal, and a storage medium.

Background

With the rapid development of information technology, new applications are continuously emerging, and network traffic is rapidly increased, which brings huge pressure and challenges to network traffic analysis. In order to ensure the effectiveness and security of network management and effectively realize monitoring of network traffic, a network supervision mechanism is required to identify, classify and control various network traffic data, wherein encrypted data has a considerable weight in a network, so how to identify encrypted data and plaintext data in network traffic becomes a technical problem to be solved urgently at present.

Disclosure of Invention

An object of the embodiments of the present application is to provide a data classification method, a model training method, an apparatus, a terminal, and a storage medium, which are used to implement identification of encrypted data and plaintext data in network traffic.

The embodiment of the application provides a data classification method, which comprises the following steps:

obtaining sample data to be tested;

according to a preset byte selection rule, sequentially performing combination selection of two bytes from the sample data to be detected;

aiming at each preset possible value corresponding to a single byte, determining occurrence frequency information corresponding to the preset possible value according to a combined selection result of the sample data to be detected;

determining a target data characteristic vector corresponding to the sample data to be detected according to each occurrence frequency information;

and inputting the target data feature vector into a preset plaintext-ciphertext data classification model to obtain a classification result of the sample data to be detected.

In the implementation process, for each preset possible value of a single byte, according to a combined selection result of sample data to be detected, occurrence frequency information corresponding to the preset possible value is determined, a target data feature vector is determined according to the occurrence frequency information corresponding to each preset possible value, whether the sample data to be detected is plaintext data or ciphertext data is determined based on the target data feature vector and a preset plaintext-ciphertext data classification model, identification and classification of network flow encrypted data and the plaintext data are achieved, and the target data feature vector is constructed based on the occurrence frequency information, so that the construction mode is simple, and the operation amount is small.

Further, the determining, for each preset possible value corresponding to a single byte, frequency of occurrence information corresponding to the preset possible value according to a combined selection result of the sample data to be tested includes:

aiming at each preset possible value corresponding to a single byte, determining a first occurrence frequency sum of all first preset byte combinations corresponding to the preset possible value and/or a second occurrence frequency sum of all second preset byte combinations corresponding to the preset possible value according to a combination selection result of the sample data to be detected; the first preset byte combination is a combination of which the preset possible value is located at a first bit in the byte combination, and the second combination is a combination of which the preset possible value is located at a second bit in the byte combination;

the determining the target data feature vector corresponding to the sample data to be detected according to each occurrence frequency information includes:

and determining a target data characteristic vector corresponding to the sample data to be detected according to the first occurrence frequency sum and/or the second occurrence frequency sum corresponding to each preset possible value.

In the implementation process, the target data feature vector is constructed according to the first occurrence frequency sum and/or the second occurrence frequency sum corresponding to the preset possible value, so that the computation amount is small, and the efficiency is high.

Further, the determining, for each preset possible value corresponding to a single byte, a first frequency sum of all first preset byte combinations corresponding to the preset possible value and/or a second frequency sum of all second preset byte combinations corresponding to the possible value according to a combination selection result of the sample data to be detected includes:

aiming at two corresponding values selected from the sample data to be tested in a combined mode at the kth time, calculating a corresponding byte transfer matrix H_k；H_kFor transferring bytes to matrix H_k-1(m) of_k+1，n_k+1) plus 1, H₀0 matrix, m, representing 256 × 256_kIndicates the value, n, corresponding to the first byte of the two bytes selected by the k-th combination_kAnd the value corresponding to the second byte in the two bytes selected by the k-th combination is shown.

After completing the combined selection of the sample data to be detected according to the preset byte selection rule and calculating to obtain a final target byte transfer matrix, adding elements of the target byte transfer matrix according to rows to obtain the first occurrence frequency sum corresponding to each preset possible value, and adding elements of the target byte transfer matrix according to columns to obtain the second occurrence frequency sum corresponding to each preset possible value;

determining a target data feature vector corresponding to the sample data to be tested according to the first frequency sum and/or the second frequency sum corresponding to each preset possible value, including:

splicing the first frequency sum and the second frequency sum corresponding to each preset possible value to obtain an intermediate data feature vector;

and obtaining the target data characteristic vector according to the intermediate data characteristic vector.

In the implementation process, 512-dimensional target data feature vectors are obtained based on the 256 × 256 byte transfer matrix, the operand is small, and the classification and identification efficiency can be improved.

aiming at each preset possible value corresponding to a single byte, determining a first occurrence frequency of each first preset byte combination corresponding to the preset possible value and a second occurrence frequency of each second preset byte combination corresponding to the preset possible value according to a combination selection result of the sample data to be detected; the first preset byte combination is a combination of which the preset possible value is located at a first bit in the byte combination, and the second combination is a combination of which the preset possible value is located at a second bit in the byte combination;

and determining a target data characteristic vector corresponding to the sample data to be detected according to the first occurrence frequency and the second occurrence frequency corresponding to each preset possible value.

In the implementation process, a target data feature vector is constructed according to the first occurrence frequency and the second occurrence frequency corresponding to each preset possible value, at this time, the occurrence frequency corresponding to each byte combination also represents the frequency of changing the previous byte value into the next byte value in the combination, the target data feature vector is constructed based on the information to distinguish plaintext data and ciphertext data, and the classification result is more accurate.

Further, the determining the target data feature vector corresponding to the sample data to be detected according to each occurrence frequency information includes:

and normalizing the frequency of occurrence information according to the combined selection times of the sample data to be detected, and determining a target data characteristic vector corresponding to the sample data to be detected based on the normalized information.

In the implementation process, the occurrence frequency information is normalized based on the combined selection times to obtain the corresponding frequency information, so that the identification accuracy can be improved.

Further, the sequentially performing combination selection of two bytes from the sample data to be detected according to a preset byte selection rule includes:

and sequentially taking two continuous bytes from the sample data to be detected until the sample data to be detected is traversed.

In the implementation process, two continuous bytes are sequentially selected from the sample data to be detected until the sample data to be detected is traversed, the combined selection of all the bytes of the sample data to be detected is completed, and the accuracy of the identification result can be ensured.

The embodiment of the application also provides a plaintext-ciphertext data classification model training method, which comprises the following steps:

acquiring training sample data with a plaintext label and training sample data with a ciphertext label;

sequentially performing combination selection of two bytes from the training sample data according to a preset byte selection rule aiming at each training sample data, and determining occurrence frequency information corresponding to each preset possible value in the training sample data according to a combination selection result of the training sample data aiming at each preset possible value corresponding to a single byte;

determining a data characteristic vector corresponding to each training sample data according to the occurrence frequency information corresponding to each training sample data;

and training according to the data characteristic vector to obtain a plaintext-ciphertext data classification model for distinguishing plaintext data from ciphertext data.

An embodiment of the present application further provides a data classification apparatus, including:

the acquisition module is used for acquiring sample data to be detected;

the selecting module is used for sequentially performing combined selection of two bytes from the sample data to be detected according to a preset byte selection rule;

the first determining module is used for determining occurrence frequency information corresponding to each preset possible value corresponding to a single byte according to a combined selection result of the sample data to be detected;

a second determining module, configured to determine, according to each occurrence frequency information, a target data feature vector corresponding to the sample data to be detected;

and the classification module is used for inputting the target data feature vector into a preset plaintext-ciphertext data classification model to obtain a classification result of the sample data to be detected.

Further, an embodiment of the present application further provides a terminal, which includes a processor and a memory, where the memory stores a computer program, and the processor executes the computer program to implement any one of the methods described above.

Further, an embodiment of the present application also provides a computer-readable storage medium, which stores a computer program, and when the computer program is executed by at least one processor, the computer program implements the method described in any one of the above.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and that those skilled in the art can also obtain other related drawings based on the drawings without inventive efforts.

Fig. 1 is a schematic flowchart of a data classification method according to an embodiment of the present application;

fig. 2 is a schematic flowchart of a plaintext-ciphertext data classification model training method according to an embodiment of the present disclosure;

fig. 3 is a schematic flowchart of a data classification method according to a second embodiment of the present application;

fig. 4 is a schematic structural diagram of a data classification apparatus according to a third embodiment of the present application;

fig. 5 is a schematic structural diagram of a terminal according to a fourth embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the descriptions relating to "first", "second", etc. in the embodiments of the present invention are only for descriptive purposes and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In addition, technical solutions between various embodiments may be combined with each other, but must be realized by a person skilled in the art, and when the technical solutions are contradictory or cannot be realized, such a combination should not be considered to exist, and is not within the protection scope of the present invention.

In the description of the present invention, it should be understood that the numerical references before the steps do not identify the order of performing the steps, but merely serve to facilitate the description of the present invention and to distinguish each step, and thus should not be construed as limiting the present invention.

Various embodiments will be provided below to describe a data classification method, a model training method, an apparatus, a terminal, and a storage medium in detail.

The first embodiment is as follows:

in order to realize identification of encrypted data and plaintext data in network traffic, an embodiment of the present application provides a data classification method, which is shown in fig. 1 and includes the following steps:

s101: and acquiring sample data to be detected.

S102: and according to a preset byte selection rule, sequentially performing combined selection of two bytes from the sample data to be detected.

In step S102, the byte values in the sample data to be tested are combined pairwise according to a preset byte selection rule, specifically, values corresponding to the ith byte and the (i + n) th byte may be sequentially taken from the sample data to be tested as a combination until the sample data to be tested is traversed, where i and n are integers greater than or equal to 1, preferably, n is equal to 1, and at this time, two consecutive bytes are sequentially taken from the sample data to be tested, so that more combinations can be obtained, and the classification result is more accurate.

S103: and aiming at each preset possible value corresponding to a single byte, determining the occurrence frequency information corresponding to the preset possible value according to the combined selection result of the sample data to be detected.

It should be noted that step S102 and step S103 do not have a necessary sequence, and in some embodiments, step S102 and step S103 may be executed simultaneously, or corresponding occurrence frequency information may be determined after all combinations are selected.

Generally speaking, the value corresponding to each single byte is between [0,255], so in this embodiment, 256 possible values may be preset for a single byte, and the 256 possible values are {0,1, …,255 }. In step S103, according to the combined selection result of the sample data to be detected, for each preset possible value in {0,1, …,255}, occurrence frequency information corresponding to the preset possible value is determined.

S104: and determining a target data characteristic vector corresponding to the sample data to be detected according to the occurrence frequency information.

For ease of understanding, the above steps S103 and S104 are explained in detail below with specific assumptions.

Assume that the preset byte selection rule is: and sequentially taking two continuous bytes from the sample data to be detected until the sample data to be detected is traversed.

Assume that the value corresponding to each byte of the sample data to be tested is {1,2,5,6,4 }. Based on the preset byte selection rule, the following combined selection results can be obtained: {(1,2),(1,5),(1,6),(1,4),(2,5), (2,6),(2,4),(5,6),(5,4)(6,4)}.

It should be noted that, in this embodiment, it is only assumed that sample data to be measured is for convenience of understanding and description, and in an actual application scenario, the number of bytes included in the sample data to be measured is often more.

Example one:

in this example, step S103 includes:

and aiming at each preset possible value corresponding to a single byte, determining a first occurrence frequency sum of all first preset byte combinations corresponding to the preset possible value according to a combination selection result of the sample data to be detected, wherein the first preset byte combination is a combination of which the preset possible value is positioned at the first position in the byte combination.

At this time, the corresponding step S104 includes:

and determining a target data characteristic vector corresponding to the sample data to be detected according to the first occurrence frequency corresponding to each preset possible value.

For each preset possible value, when the preset possible value is located at the first bit in the byte combination, there are 256 possibilities for the second bit in the combination, and therefore, preferably, 256 first preset byte combinations may be set for each preset possible value, for example, for the preset possible value 0, the corresponding first preset byte combination may be (0,0), (0,1), (0,2), …, (0,255), so at this time, step S103 is substantially for each preset possible value corresponding 256 possibilities, and when the byte in the sample data to be tested is selected according to the preset byte selection rule, the occurrence frequency sum of the 256 possibilities is counted, and the occurrence frequency sum is taken as the first occurrence frequency sum.

Based on the above assumed scenario, for the preset possible value 1, the corresponding first preset byte combinations are (1,0), (1,1), (1,2), …, and (1,255), and when the combination selection is performed on the sample data to be tested according to the preset byte selection rule, for the preset possible value 1, only the 4 first preset byte combinations of (1,2), (1,5), (1,6), and (1,4) occur, and the frequency of occurrence of each corresponding first preset byte combination is 1, so the first frequency sum of occurrence of the corresponding first preset byte combination is 4. Similarly, for the preset possible value 2, the first occurrence frequency sum of the corresponding first preset byte combination is 3; for other possible values, such as 0, 3, 4, etc., since the corresponding first preset byte combination does not appear when the combination selection is performed on the sample data to be tested according to the preset byte selection rule, the corresponding first frequency sum is 0.

In step S104 in this example, the first frequency sums corresponding to the 256 preset possible values may be spliced to obtain a 256-dimensional target data feature vector.

Example two:

in this example, step S103 includes:

and aiming at each preset possible value corresponding to a single byte, determining a second occurrence frequency sum of all second preset byte combinations corresponding to the preset possible value according to a combination selection result of the sample data to be detected, wherein the second preset byte combination is a combination of which the preset possible value is positioned at a second position in the byte combinations.

At this time, the corresponding step S104 includes:

and determining a target data characteristic vector corresponding to the sample data to be detected according to the second occurrence frequency corresponding to each preset possible value.

Similarly to the above-mentioned manner of setting the first preset byte combination, for each preset possible value, when it is located at the second bit of the byte combination, there are 256 possibilities corresponding to the first bit in the combination, so it is preferable that 256 second preset byte combinations are set for each preset possible value, for example, for a preset possible value 0, the corresponding first preset byte combination may be (0,0), (1,0), (2,0), …, (255, 0).

Similarly, based on the above assumed scenario, for the preset possible value 5, the corresponding second preset byte combination is (0,5), (1,5), (2,5), …, (255,5), and when the combination selection is performed on the sample data to be tested according to the preset byte selection rule, for the preset possible value 5, only the 2 second preset byte combinations of (1,5), (2,5) occur, and the occurrence frequency of each corresponding second preset byte combination is 1, so that the second occurrence frequency sum of the corresponding second preset byte combination is 2.

In step S104 in this example, the second frequency sums corresponding to the 256 preset possible values may be spliced to obtain a 256-dimensional target data feature vector.

Example three:

in this example, the schemes of the first example and the second example may be combined, that is, in step S103, for each preset possible value corresponding to a single byte, according to the combination selection result of the sample data to be detected, the first frequency sum of all the first preset byte combinations corresponding to the preset possible value and the second frequency sum of all the second preset byte combinations corresponding to the preset possible value may be determined.

In step S104 in this example, the first frequency of occurrence corresponding to each of the 256 preset possible values and the second frequency of occurrence corresponding to each of the 256 preset possible values may be spliced to obtain a 512-dimensional target data feature vector.

Example four:

in this example, step S103 includes:

aiming at each preset possible value corresponding to a single byte, determining a first occurrence frequency of each first preset byte combination corresponding to the preset possible value and a second occurrence frequency of each second preset byte combination corresponding to the preset possible value according to a combination selection result of sample data to be detected; the first predetermined byte combination is a combination in which the predetermined possible value is located at a first bit in the byte combination, and the second predetermined byte combination is a combination in which the predetermined possible value is located at a second bit in the byte combination.

At this time, the corresponding step S104 includes:

The difference between this example and the above three examples is that in step S103, for each preset possible value, the first occurrence frequency of each first preset byte combination corresponding to the preset possible value and the second occurrence frequency of each second preset byte combination need to be counted. Assuming that the number of the preset possible values is 256, and the number of the first preset byte combination and the number of the second preset byte combination corresponding to each preset possible value are both 256, each preset possible value has 256 first occurrence frequencies and 256 second occurrence frequencies, in step S104, all the first occurrence frequencies and all the second occurrence frequencies corresponding to each preset possible value may be spliced, so as to obtain a 256 × 256-dimensional target data feature vector.

The principle of the data classification method provided by the embodiment of the present application is explained here, and for plaintext data, since it is usually a meaningful character or bit set, the two characters before and after the plaintext data usually vary within a small range. In other words, the latter character can be regarded as being transformed by the former character with a certain probability, frequency or frequency. For the data encrypted by using the symmetric encryption algorithm or the asymmetric encryption algorithm, the data is obtained by performing operations such as multiple rounds of shifting, permutation, exclusive or and the like on plaintext data and a key, and no longer represents a meaningful character or bit set, so that the change between the front character and the rear character is relatively large. At this time, although the next character can still be regarded as being converted from the previous character, the conversion frequency or frequency of the two characters before and after the next character is significantly smaller than the conversion frequency or frequency between the two characters before and after the next character in the plaintext data.

Since both plaintext data and ciphertext data may be viewed as simply a sequence of bytes, for a single byte, the byte value falls between [0,255 ]. For a byte sequence of plaintext data or ciphertext data, it can be considered as: the former byte between [0,255] is converted into the latter byte between [0,255] at a certain conversion frequency or frequency; the difference is that the frequency or frequency of transitions between any two bytes of plaintext data is typically greater than the frequency or frequency of transitions between any two bytes of ciphertext data. The data classification method provided by the embodiment of the application utilizes the occurrence frequency information corresponding to the preset possible value to represent the conversion frequency between two bytes, and realizes the classification of plaintext data and ciphertext data through the difference of the conversion frequency.

In this embodiment, the first frequency sum, the second frequency sum, the first frequency sum or the second frequency sum may be calculated by constructing a byte transfer matrix.

Specifically, the byte transfer matrix may be constructed in the following manner:

aiming at two corresponding values selected from the sample data to be tested in a combined mode at the kth time, calculating a corresponding byte transfer matrix H_k；H_kFor transferring bytes to matrix H_k-1(m) of_k+1，n_k+1) plus 1, H₀0 matrix, m, representing 256 × 256_kIndicates the value, n, corresponding to the first byte of the two bytes selected by the k-th combination_kRepresenting the value corresponding to the second byte in the two bytes selected by the kth combination; h_kRepresents the byte transfer matrix H corresponding to the k-th time combined and selected from the sample data to be tested₀The representation also represents the initial matrix;

after the combined selection of the sample data to be detected is completed according to the preset byte selection rule and the final target byte transfer matrix is obtained through calculation, the calculation can be carried out according to the target byte transfer matrix, and the corresponding first occurrence frequency sum, second occurrence frequency sum, first occurrence frequency sum or second occurrence frequency sum is obtained for each preset possible value.

For example, the elements of the target byte transfer matrix may be added in rows to obtain a first frequency sum corresponding to each preset possible value, and the elements of the target byte transfer matrix may be added in columns to obtain a second frequency sum corresponding to each preset possible value; at this time, the intermediate data feature vector may be obtained by splicing the first frequency sum and the second frequency sum corresponding to each preset possible value, and the target data feature vector may be obtained according to the intermediate data feature vector.

Or, the elements in the target byte transfer matrix may be directly extracted, each element value represents the occurrence frequency of the corresponding preset byte combination, that is, the first occurrence frequency and the second occurrence frequency, that is, the target byte transfer matrix may be directly spliced in rows or columns to obtain intermediate data feature vectors, and the target data feature vectors may be obtained according to the intermediate data feature vectors.

In this embodiment, normalization processing may be performed on the occurrence frequency information corresponding to each preset possible value according to the number of times of combination selection on the sample data to be detected, and a target data feature vector corresponding to the sample data to be detected may be determined based on the information after the normalization processing.

Therefore, for the intermediate data feature vector, normalization processing can be performed on the intermediate data feature vector according to the combination selection times of the sample data to be detected, so as to obtain a target data feature vector, and at this time, each element in the target data feature vector represents the occurrence frequency of the corresponding preset byte combination.

S105: and inputting the target data feature vector into a preset plaintext-ciphertext data classification model to obtain a classification result of the sample data to be detected.

The plaintext-ciphertext data classification model in this embodiment is a preset model that can classify plaintext data and ciphertext data of sample data to be detected.

The embodiment also provides a plaintext-ciphertext data classification model training method, please refer to fig. 2, which includes:

s201: and acquiring training sample data with a plaintext label and training sample data with a ciphertext label.

S202: and according to a combination selection result of the training sample data, determining occurrence frequency information corresponding to each preset possible value in the training sample data.

S203: and determining the data characteristic vector corresponding to each training sample data according to the occurrence frequency information corresponding to each training sample data.

The process of determining the data feature vector in the model training process is similar to the above-described manner of determining the target data feature vector, and is not repeated here.

S204: and training according to the data feature vector to obtain a plaintext-ciphertext data classification model for distinguishing plaintext data from ciphertext data.

In step S204, the data feature vectors may be divided into a training set and a test set, and a machine learning algorithm is selected for training and testing, wherein the machine learning algorithm may be a supervised learning algorithm such as a decision tree algorithm, an SVM algorithm, a naive bayesian algorithm, etc., and the model parameters are adjusted to select one or more better classification models on the test set as the final classification model.

Example two:

in order to better understand the scheme provided by the present application, an embodiment of the present application provides a more specific data classification method, please refer to fig. 3, which includes the following steps:

s301: a 256 x 256 dimensional matrix is constructed with all elements of the matrix initialized to 0.

S302: for sample data to be tested with the length of N, starting from the ith byte, taking two continuous bytes, recording the value of the first byte as m, and recording the value of the second byte as N.

Wherein, i is 1,2, … …, N-1.

S303: the element values at the 256 x 256 dimensional matrix (m +1, n +1) are added by 1.

S304: it is determined whether i +1 is equal to N-1, if not, S302 is performed, and if so, S305 is performed.

S305: the 256 x 256 dimensional matrix elements are added row by row to obtain 256 first sums.

S306: the 256 x 256 dimensional matrix elements are added column by column to obtain 256 second sums.

The first sum and the second sum in step S305 and step S306 are the first frequency of occurrence and the second frequency of occurrence in the first embodiment.

S307: and splicing the 256 first sum values and the 256 second sum values to obtain 512-dimensional intermediate data feature vectors.

S308: and dividing the 512-dimensional intermediate data feature vector by N-1 for normalization to obtain a target data feature vector.

S309: and inputting the target data feature vector into a preset plaintext-ciphertext data classification model to obtain a classification result of the sample data to be detected.

In this embodiment, the above-mentioned content is a stage of performing classification prediction on unknown data, and a stage of training to obtain a classification model is described below.

The method comprises the following steps: a training sample is constructed.

For a batch of plaintext data, various encryption algorithms are utilized to encrypt the data to obtain ciphertext data. And (3) marking 'plaintext' on plaintext data as a tag, marking 'ciphertext' on ciphertext data as a tag, and taking the two groups of tagged data as training sample data.

Step two: a data feature vector is calculated.

For each piece of data in the two groups of data, the 512-dimensional feature vector corresponding to each piece of data is calculated in a similar manner of calculating the feature vector of the target data.

Step three: and constructing a feature matrix.

And combining the data feature vectors of all the data to form a feature matrix. And simultaneously, recording the label of each data feature vector in the feature matrix to obtain a label vector.

Step four: and training a classification model.

And dividing the data feature vectors into a training set and a testing set, and selecting a machine learning algorithm for training and testing, wherein the selectable machine learning algorithms comprise supervised learning algorithms such as a decision tree algorithm, an SVM algorithm, a naive Bayes algorithm and the like. And adjusting model parameters, and selecting one or more better classification models on the test set as final classification models.

Example three:

an embodiment of the present application provides a data classification apparatus, please refer to fig. 4, including:

an obtaining module 41, configured to obtain sample data to be tested;

a selecting module 42, configured to sequentially perform combination selection of two bytes from sample data to be detected according to a preset byte selection rule;

a first determining module 43, configured to determine, for each preset possible value corresponding to a single byte, occurrence frequency information corresponding to the preset possible value according to a combination selection result of sample data to be detected;

a second determining module 44, configured to determine, according to each occurrence frequency information, a target data feature vector corresponding to the sample data to be detected;

and the classification module 45 is used for inputting the target data feature vector into a preset plaintext-ciphertext data classification model to obtain a classification result of the sample data to be detected.

In an exemplary embodiment, the selecting module 42 is configured to sequentially take values corresponding to an ith byte and an (i + n) th byte from sample data to be detected as a combination until the sample data to be detected is traversed, where i and n are integers greater than or equal to 1, preferably, n is equal to 1, which is equivalent to sequentially taking two consecutive bytes from the sample data to be detected, so that more combinations can be obtained, and a classification result is more accurate.

Generally speaking, the value corresponding to each single byte is between [0,255], so in this embodiment, 256 possible values may be preset for a single byte, and the 256 possible values are {0,1, …,255 }. The first determining module 43 is configured to determine, according to the combined selection result of the selection module 42 on the sample data to be detected, occurrence frequency information corresponding to each preset possible value in {0,1, …,255 }.

In an exemplary embodiment, the first determining module 43 is configured to determine, for each preset possible value corresponding to a single byte, a first occurrence frequency sum of all first preset byte combinations corresponding to the preset possible value and/or a second occurrence frequency sum of all second preset byte combinations corresponding to the preset possible value according to a combination selection result of the sample data to be detected; the first predetermined byte combination is a combination in which the predetermined possible value is located at a first bit in the byte combination, and the second predetermined byte combination is a combination in which the predetermined possible value is located at a second bit in the byte combination. Correspondingly, the second determining module 44 is configured to determine the target data feature vector corresponding to the sample data to be detected according to the first occurrence frequency sum and/or the second occurrence frequency sum corresponding to each preset possible value.

In an exemplary embodiment, the first determining module 43 is configured to determine, for each preset possible value corresponding to a single byte, a first frequency of occurrence of each first preset byte combination corresponding to the preset possible value and a second frequency of occurrence of each second preset byte combination corresponding to the preset possible value according to a combination selection result of sample data to be detected; the first predetermined byte combination is a combination in which the predetermined possible value is located at a first bit in the byte combination, and the second predetermined byte combination is a combination in which the predetermined possible value is located at a second bit in the byte combination. Correspondingly, the second determining module 44 is configured to determine the target data feature vector corresponding to the sample data to be detected according to the first occurrence frequency and the second occurrence frequency corresponding to each preset possible value.

In an exemplary embodiment, the data classification device further includes a normalization module, configured to perform normalization processing on the occurrence frequency information corresponding to each preset possible value according to the number of times of combination selection on the sample data to be detected, and determine a target data feature vector corresponding to the sample data to be detected based on the information after the normalization processing.

Example four:

based on the same inventive concept, an embodiment of the present application provides a terminal, please refer to fig. 5, which includes a processor 51 and a memory 52, where the memory 52 stores a computer program, and the processor 51 executes the computer program to implement the steps of the data classification method and/or the plaintext-ciphertext data classification model training method in the first and second embodiments, which are not described herein again.

It will be appreciated that the configuration shown in fig. 5 is merely illustrative and that the terminal may include more or fewer components than shown in fig. 5 or may have a different configuration than shown in fig. 5.

The processor may be an integrated circuit chip having signal processing capabilities. The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. Which may implement or perform the various methods, steps, and logic blocks disclosed in the embodiments of the present application.

The Memory may include, but is not limited to, Random Access Memory (RAM), Read Only Memory (ROM), Programmable Read Only Memory (PROM), Erasable Read Only Memory (EPROM), Electrically Erasable Read Only Memory (EEPROM), and the like.

The present embodiment further provides a computer-readable storage medium, such as a floppy disk, an optical disk, a hard disk, a flash Memory, a usb (universal Digital Memory Card), an MMC (Multimedia Card), etc., where one or more programs for implementing the above steps are stored in the computer-readable storage medium, and the one or more programs may be executed by one or more processors to implement the steps of the data classification method and/or the plaintext-ciphertext data classification model training method in the first and second embodiments, which will not be described herein again.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation.

The above description is only an example of the present application and is not intended to limit the scope of the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A method of data classification, comprising:

obtaining sample data to be tested;

2. The data classification method according to claim 1, wherein the determining, for each preset possible value corresponding to a single byte, frequency of occurrence information corresponding to the preset possible value according to a combined selection result of the sample data to be tested comprises:

aiming at each preset possible value corresponding to a single byte, determining a first occurrence frequency sum of all first preset byte combinations corresponding to the preset possible value and/or a second occurrence frequency sum of all second preset byte combinations corresponding to the preset possible value according to a combination selection result of the sample data to be detected; the first preset byte combination is a combination of which the preset possible value is located at the first bit in the byte combination, and the second preset byte combination is a combination of which the preset possible value is located at the second bit in the byte combination;

3. The data classification method according to claim 2, wherein the determining, for each preset possible value corresponding to a single byte, a first frequency sum of all first preset byte combinations corresponding to the preset possible value and/or a second frequency sum of all second preset byte combinations corresponding to the possible value according to the combination selection result of the sample data to be detected includes:

aiming at two corresponding values selected from the sample data to be tested in a combined mode at the kth time, calculating a corresponding byte transfer matrix H_k；H_kFor transferring bytes to matrix H_k-1(m) of_k+1，n_k+1) plus 1, H₀0 matrix, m, representing 256 × 256_kIndicates the value, n, corresponding to the first byte of the two bytes selected by the k-th combination_kIndicating the value corresponding to the second byte of the two bytes selected by the k-th combination,

4. The data classification method according to claim 1, wherein the determining, for each preset possible value corresponding to a single byte, frequency of occurrence information corresponding to the preset possible value according to a combined selection result of the sample data to be tested comprises:

aiming at each preset possible value corresponding to a single byte, determining a first occurrence frequency of each first preset byte combination corresponding to the preset possible value and a second occurrence frequency of each second preset byte combination corresponding to the preset possible value according to a combination selection result of the sample data to be detected; the first preset byte combination is a combination of which the preset possible value is located at the first bit in the byte combination, and the second preset byte combination is a combination of which the preset possible value is located at the second bit in the byte combination;

5. The data classification method according to claim 1, wherein the determining a target data feature vector corresponding to the sample data to be tested according to each occurrence frequency information comprises:

6. The data classification method according to any one of claims 1 to 5, wherein the performing, according to a preset byte selection rule, combined selection of two bytes from the sample data to be tested in sequence comprises:

7. A plaintext-ciphertext data classification model training method is characterized by comprising the following steps:

8. A data sorting apparatus, comprising:

the acquisition module is used for acquiring sample data to be detected;

9. A terminal, characterized in that it comprises a processor and a memory, in which a computer program is stored, which computer program is executed by the processor to implement the method according to any one of claims 1-7.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when executed by at least one processor, implements the method according to any one of claims 1-7.