CN112784910A

CN112784910A - Deep filtering method and system for junk data

Info

Publication number: CN112784910A
Application number: CN202110122376.XA
Authority: CN
Inventors: 蒙政先; 蔡楚才
Original assignee: Wuhan Bochang Software Development Co ltd
Current assignee: Wuhan Bochang Software Development Co ltd
Priority date: 2021-01-28
Filing date: 2021-01-28
Publication date: 2021-05-11

Abstract

The invention discloses a method and a system for deep filtering of junk data, wherein the method comprises the following steps: acquiring network data, and performing quintuple preliminary filtering on the network data; performing text vectorization representation on the data subjected to preliminary filtering, performing clustering division on the text subjected to the vectorization representation by adopting an improved k-means clustering algorithm, determining a data source, and performing secondary filtering based on the data source; and performing deep content filtering based on the convolutional neural network. The invention realizes multi-level deep filtration of illegal data and garbage, ensures data safety and improves filtration precision.

Description

Deep filtering method and system for junk data

Technical Field

The invention relates to the technical field of data filtering, in particular to a depth data filtering method and system.

Background

With the development of information industries such as computers, the internet of things and the like, massive data streams are constantly flowing on the network, and with the rise of a big data concept, massive data is processed by means of big data analysis, monitored and filtered, and related business services can be efficiently and accurately improved.

Existing data filtering is mostly based on protocol filtering, by combining one or more of the five tuples, i.e. at least one of the source IP address, source port, destination IP address, destination port, or transport layer protocol. When the blacklist is adopted for filtering, when the data to be filtered accords with a certain record in the blacklist, corresponding filtering operation is executed to reject the data and/or discard the data, otherwise, the data is enabled to pass. When the filtering type is white list filtering, when the data to be filtered is consistent with a certain record in the white list, corresponding filtering operation is executed to enable the data to pass, otherwise, the data is rejected to pass or discarded. The modes play a good protection role in data security to a certain extent, and have a common filtering effect on the junk data in the mass network data.

Disclosure of Invention

In view of this, the invention provides a method and a system for filtering data of an internet of things, which are used for solving the problem that the existing data filtering method cannot effectively filter illegal data.

In a first aspect of the present invention, a method for deep filtering of garbage data is disclosed, the method comprising:

acquiring network data, and performing quintuple preliminary filtering on the network data;

performing text vectorization representation on the data subjected to preliminary filtering, performing clustering division on the text subjected to the vectorization representation by adopting an improved k-means clustering algorithm, determining a data source, and performing secondary filtering based on the data source;

and carrying out deep content filtering based on an optimized AdaBoost method.

Preferably, the five-tuple preliminary filtering includes filtering of a source IP address, a destination IP address, a source port number, a destination port number, and a transport protocol type.

Preferably, the performing text vectorization representation on the data after the preliminary filtering, performing cluster division on the text represented by the vectorization representation, and determining the data source specifically includes:

restoring the preliminarily filtered network data packet based on different protocols to obtain a binary file;

extracting features of the binary file based on a bag-of-words model to obtain a text feature vector, and finishing text vectorization expression;

the method comprises the steps of obtaining a standard text vector set, dividing the standard text vector set into a plurality of clusters by adopting an improved k-means clustering algorithm, determining the central point of each cluster, calculating the cluster to which a vectorized text belongs, and determining a data source in the cluster to which the text belongs in a similarity calculation mode.

Preferably, the dividing the training sample set into a plurality of clusters by using the improved k-means clustering algorithm specifically includes:

setting a population scale and a boundary, initializing the number of cluster clusters in the boundary range, carrying out K-means clustering based on the number of different cluster clusters, calculating the fitness under different clustering results, calculating the optimal position according to the fitness and carrying out individual position updating, carrying out iterative operation to finally obtain the optimal position as the number K of the cluster clusters, wherein the optimization target of the particle swarm algorithm is that the sum of the intra-class distance value means of each cluster is minimum.

Preferably, the depth content filtering based on the optimized AdaBoost method specifically includes:

adopting a conditional generation countermeasure network to perform data enhancement on the standard text vector set to obtain a training set;

training a convolutional neural network classification model through the training set, inputting the data after secondary filtering into the convolutional neural network model, and performing deep content filtering according to a classification result.

In a second aspect of the present invention, a garbage data depth filtering system is disclosed, the system comprising:

a preliminary filtering module: acquiring network data, and performing quintuple preliminary filtering on the network data;

a data source filtering module: performing text vectorization representation on the data subjected to preliminary filtering, and performing clustering division on the text subjected to vectorization representation by adopting an improved k-means clustering algorithm to determine a data source;

a content filtering module: and performing deep content filtering based on the convolutional neural network.

Preferably, the data source filtering module specifically includes:

a vectorization unit: extracting features of the binary file based on a bag-of-words model to obtain a text feature vector, and finishing text vectorization expression;

a clustering unit: acquiring a standard text vector set, dividing the standard text vector set into a plurality of clusters by adopting an improved k-means clustering algorithm, determining the central point of each cluster, and calculating the cluster to which the vectorized text belongs;

a data source filtering unit: and determining a data source in the belonged cluster in a similarity calculation mode, and carrying out secondary filtering.

Preferably, the content filtering module specifically includes:

a data enhancement unit: adopting a conditional generation countermeasure network to perform data enhancement on the standard text vector set to obtain a training set;

a depth filtering unit: training a convolutional neural network classification model through the training set, inputting the data after secondary filtering into the convolutional neural network model, and performing deep content filtering according to a classification result.

Compared with the prior art, the invention has the following beneficial effects:

the method comprises the steps of performing quintuple preliminary filtering on network data, tracing the data based on an improved clustering algorithm, and performing secondary filtering based on a data source; the method and the device have the advantages that the countermeasure network enhanced data set is generated by adopting conditions, and deep content filtering is performed on the basis of the convolutional neural network model.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a schematic flow chart of a garbage data depth filtering method according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.

As shown in fig. 1, the present invention provides a method for deep filtering garbage data, where the method includes:

s1, acquiring network data, and performing quintuple preliminary filtering on the network data; the five-tuple preliminary filtering includes filtering of a source IP address, a destination IP address, a source port number, a destination port number, and a transport protocol type.

S2, performing text vectorization representation on the data after preliminary filtering, performing clustering division on all vectorization represented texts by adopting an improved k-means clustering algorithm, determining a data source, and performing secondary filtering based on the data source; the method specifically comprises the following steps:

s21, restoring the preliminarily filtered network data packet based on different protocols to obtain a binary file;

s22, extracting features of the binary file based on the bag-of-words model to obtain a text feature vector, and finishing text vectorization expression;

s23, obtaining a standard text vector set, and dividing the standard text vector set into a plurality of clusters by adopting a k-means clustering algorithm improved by a particle swarm algorithm; in particular, the method comprises the following steps of,

setting a population scale N and a boundary range [ L, U ], initializing a population position in the boundary range, and obtaining N different initial cluster numbers;

performing k-means clustering on the standard text vector set based on the number of different clustering clusters, calculating and sequencing fitness under different clustering results, obtaining an optimal position according to the fitness sequencing, and updating the individual position;

iterative operation is carried out until a termination condition is reached, an optimal individual position is obtained and serves as the optimal clustering cluster number K, and the optimization target of the floating algorithm is that the sum of the mean values of the intra-cluster distance values of all clusters is minimum and the inter-cluster distance is maximum;

and performing K-means clustering on the standard text vector set based on the number K of the optimal clustering clusters, dividing the standard text vector set into K clusters, and determining the central point of each cluster.

The clustering category number of the k-means clustering algorithm is usually required to be given in advance, different clustering category numbers have great influence on a partitioning result, and the clustering category number is difficult to be determined in advance for unknown data sets or large data sets. Based on the optimal clustering category number, the optimal clustering center combination is calculated through the Wu-Wei-gull optimization algorithm, and accurate data clustering division is realized, so that the data traceability accuracy is improved, and the data secondary filtering error caused by inaccurate clustering division is reduced. By optimizing the clustering category number and the clustering center, accurate clustering can be realized and errors can be reduced.

S24, based on the optimal clustering category number, calculating the optimal clustering center combination through a k-means clustering algorithm improved by a Woofer optimization algorithm, and determining the center point of each cluster;

specifically, the population scale, the boundary and the iteration times of the Woofer optimization algorithm are set, the population dimension is the same as the clustering category number K, and the data of each dimension represents a clustering center position. Initializing the cluster number in a boundary range, carrying out migration operation and attack operation on individuals in the cluster, carrying out position updating, calculating a fitness value, recording a global optimal value, judging whether the iteration times are reached, if not, carrying out the migration operation and the attack operation on the individuals in the cluster again, carrying out the position updating and calculating the fitness, and finally obtaining an optimal position as an optimal combination in clustering to obtain a central point of each cluster through iterative operation. The fitness function of the gull optimization algorithm is the minimum sum of the distances in various clusters.

S25, respectively calculating Euclidean distances between the vectorized text and the center points of the various clusters, and selecting the cluster with the small Euclidean distance as the cluster to which the text belongs;

s26, determining a data source in the belonged cluster by means of similarity calculation, calculating cosine similarity between the text expressed by vectorization and samples in the belonged cluster respectively, determining the data source according to the cosine similarity, and performing secondary filtering based on the data source.

And S3, performing deep content filtering based on the optimized neural network.

Corresponding to the embodiment of the method, the invention also provides a garbage data depth filtering system, which comprises:

a preliminary filtering module: the system comprises a data processing module, a data processing module and a data processing module, wherein the data processing module is used for acquiring network data and carrying out quintuple primary filtering on the network data;

a data source filtering module: the system comprises a data source, a data vector and a data vector, wherein the data vector is used for performing text vectorization representation on the data after primary filtering, and clustering division is performed on the text subjected to vectorization representation by adopting an improved k-means clustering algorithm to determine the data source; the data source filtering module specifically comprises:

a data source filtering unit: and determining a data source in a similarity calculation mode in the belonged cluster, and performing secondary filtering.

A content filtering module: and performing deep content filtering based on the convolutional neural network. The content filtering module specifically comprises:

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A method for deep filtering of junk data, the method comprising:

and performing deep content filtering based on the convolutional neural network.

2. The method of claim 1, wherein the five-tuple prefiltering comprises filtering of a source IP address, a destination IP address, a source port number, a destination port number, and a transport protocol type.

3. The method according to claim 1, wherein the performing text vectorization on the preliminarily filtered data, performing cluster division on the text represented in the vectorization mode, and determining the data source specifically includes:

4. The method for deep filtering of garbage data according to claim 3, wherein the step of employing the improved k-means clustering algorithm to divide the training sample set into a plurality of clusters specifically comprises:

optimizing the clustering category number of the k-means clustering algorithm through a particle swarm algorithm: setting the population scale and the boundary of a particle swarm algorithm, initializing the number of clustering clusters in the boundary range, carrying out K-means clustering on a training sample set based on different numbers of clustering clusters, calculating the fitness under different clustering results, setting a fitness threshold, calculating an optimal position according to the fitness and carrying out individual position updating, carrying out iterative operation to finally obtain the optimal position meeting the fitness threshold as the number K of the clustering clusters, wherein the optimization target of the particle swarm algorithm is that the sum of the intra-class distance value means of each cluster is minimum.

5. The method for deep filtering of spam data according to claim 4, wherein the performing deep content filtering based on the optimized AdaBoost method specifically comprises:

6. A spam data depth filtering system, the system comprising:

7. The system of claim 6, wherein the data source filtering module specifically comprises:

8. The system for deep filtering of spam data according to claim 7, wherein the content filtering module specifically comprises: