CN112380406A - Real-time network traffic classification method based on crawler technology - Google Patents

Real-time network traffic classification method based on crawler technology Download PDF

Info

Publication number
CN112380406A
CN112380406A CN202011274274.1A CN202011274274A CN112380406A CN 112380406 A CN112380406 A CN 112380406A CN 202011274274 A CN202011274274 A CN 202011274274A CN 112380406 A CN112380406 A CN 112380406A
Authority
CN
China
Prior art keywords
data source
network traffic
real
feature
time network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011274274.1A
Other languages
Chinese (zh)
Other versions
CN112380406B (en
Inventor
童瀛
周宇
梁剑
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Guangxin Technology Co ltd
Original Assignee
Hangzhou Guangxin Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Guangxin Technology Co ltd filed Critical Hangzhou Guangxin Technology Co ltd
Priority to CN202011274274.1A priority Critical patent/CN112380406B/en
Publication of CN112380406A publication Critical patent/CN112380406A/en
Application granted granted Critical
Publication of CN112380406B publication Critical patent/CN112380406B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/906Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation

Abstract

The invention discloses a real-time network traffic classification method based on a crawler technology, which comprises the steps of obtaining a data source and a keyword feature library, wherein the keyword feature library comprises a threshold value, the data source comprises a feature vector and a general feature value, calculating the feature vector according to a factor weighted sum calculation method to obtain a weight sum, obtaining the general feature value in the data source when the weight sum is larger than the threshold value, classifying real-time network traffic based on the general feature value, extracting feature information of an object to be classified by filtering and screening data of a specific type of the internet through a carefully designed crawler algorithm, updating the feature information into the database in real time, and rapidly classifying the network traffic by matching features on the basis of simple analysis messages, namely improving the instantaneity of classification of the network traffic, and the accuracy of network flow is also ensured.

Description

Real-time network traffic classification method based on crawler technology
Technical Field
The invention relates to the field of data identification, in particular to a real-time network traffic classification method based on a crawler technology.
Background
With the development of network technology, the weight of the network in daily production and life in society is higher and higher. Meanwhile, the mutual game process is always carried out between the maintenance of the network space security and the network malicious attack activity, so that network attacks such as Trojan horses, computer worms, denial of service and the like are more and more frequent, and the normal use of the network by people is seriously influenced. The network flow identification technology is used as the basis of network security and plays an important role in guaranteeing the reasonable operation of the network and maintaining the information security. On one hand, unnecessary network connection can be reduced through accurate identification of flow, and the risk of network attack is avoided. On the other hand, the network manager can reasonably and effectively distribute network resources through flow identification, and better network service is provided. The network traffic identification technology starts from the birth of the internet, and goes through a development process from simplicity to complexity along with the improvement of network security awareness of people.
The widely used DPI technology based on pattern matching and DFI technology based on flow statistical characteristics and machine learning algorithm have the difficulty of manually marking a large number of samples and extracting identification characteristics. In addition, in the face of current large-scale network data, good balance between real-time performance and accuracy of network traffic identification is difficult to achieve, and the requirement of the current high-speed complex network is difficult to meet by adopting a single identification technology.
Disclosure of Invention
The invention provides a real-time network traffic classification method based on a crawler technology, and aims to solve the problems that in the prior art, a single identification technology is adopted, so that real-time network traffic is difficult to accurately identify and can be accurately classified.
In order to achieve the purpose, the invention adopts the following technical scheme:
the invention discloses a real-time network traffic classification method based on a crawler technology, which comprises the following steps:
acquiring a data source and a keyword feature library, wherein the keyword feature library comprises a threshold value, and the data source comprises a feature vector and a general feature value;
calculating the eigenvector according to a factor weighted sum calculation method to obtain a weight sum;
and when the weight sum is larger than the threshold value, acquiring the general characteristic value in the data source, and classifying the real-time network traffic based on the general characteristic value.
Acquiring all feature vectors in a data source, acquiring keywords from a database, matching all the feature vectors with the keywords, matching the feature vectors to obtain feature vectors which are successfully matched, acquiring weights and thresholds corresponding to the keywords matched with the feature vectors from the database, calculating the feature vectors according to a factor weighted sum calculation method to obtain weight sums, comparing the weight sums with the thresholds, if the weight sums are greater than the thresholds, considering that the data source belongs to a certain classification, acquiring general feature values in the data source to establish a feature database, and using the feature database to participate in the classification of a flow network.
Preferably, the method for acquiring the data source and the keyword feature library, wherein the keyword feature library comprises a threshold, the data source comprises a feature vector and a general feature value, and the method comprises the following steps:
and acquiring the feature vector in the data source according to a crawler technology.
Preferably, the calculating the feature vector according to a factor weighted sum calculation method to obtain a weighted sum includes:
acquiring the keyword feature library, wherein the keyword feature library also comprises keywords and weights;
traversing and matching the data source and the keywords to obtain the matched feature vectors;
and inputting the feature vector into a calculation formula sigma (aw + b) > theta to obtain the weight and theta, wherein w is the weight, theta is a threshold value, a is a scale coefficient corresponding to the feature vector, and b is an initial estimated value.
Preferably, when the sum of weights is greater than the threshold, obtaining the general feature value in the data source, and classifying the real-time network traffic based on the general feature value includes:
when the weight sum is larger than the threshold value, classifying the data source;
acquiring the general characteristic values in the data source, and establishing a characteristic database according to the general characteristic values;
classifying the real-time network traffic based on the feature database;
and when the weight sum is less than or equal to the threshold value, returning to the keyword feature library for matching again.
A real-time network traffic classification device based on crawler technology comprises:
an acquisition module: the method comprises the steps of obtaining a data source and a keyword feature library, wherein the keyword feature library comprises a threshold value, and the data source comprises a feature vector and a general feature value;
a processing module: the characteristic vector is calculated according to a factor weighted sum calculation method to obtain a weighted sum;
a classification module: and when the weight sum is greater than the threshold value, acquiring the general characteristic value in the data source, and classifying the real-time network traffic based on the general characteristic value.
Preferably, the acquiring module specifically includes:
a first acquisition unit: the feature vector in the data source is obtained according to a crawler technology.
Preferably, the processing module specifically includes:
a third acquisition unit: the keyword feature library is used for acquiring keywords and also comprises keywords and weights;
a matching unit: the characteristic vector matching device is used for traversing and matching the data source and the keywords to obtain the characteristic vector which is matched with the keywords;
a calculation unit: and the method is used for inputting the feature vector into a calculation formula sigma (aw + b) > theta to obtain the weight sum theta, wherein w is the weight, theta is a threshold value, a is the feature vector, and b is an estimated initial value.
Preferably, the classification module includes:
a first classification unit: for classifying the data source when the sum of weights is greater than the threshold;
the establishing unit: the characteristic database is used for acquiring the general characteristic values in the data source and establishing a characteristic database according to the general characteristic values;
a second classification unit: for classifying the real-time network traffic based on the feature database;
a matching subunit: and when the weight sum is less than or equal to the threshold value, returning to the keyword feature library for matching again.
An electronic device comprising a memory and a processor, the memory for storing one or more computer instructions, wherein the one or more computer instructions are executed by the processor to implement a crawler technology-based real-time network traffic classification method as claimed in any one of the above.
A computer-readable storage medium storing a computer program which, when executed, causes a computer to implement a real-time web traffic classification method based on crawler technology as recited in any one of the above.
The invention has the following beneficial effects:
the characteristic information of the objects to be classified is extracted by filtering and screening the data of the specific type of the internet through a well-designed crawler algorithm, the characteristic information is updated into a database in real time, and the network traffic can be rapidly classified by matching the characteristics on the basis of simply analyzing the messages, so that the classification instantaneity of the network traffic is improved, and the accuracy of the network traffic is also ensured.
Drawings
FIG. 1 is a first flowchart of a real-time network traffic classification method based on crawler technology according to an embodiment of the present invention;
FIG. 2 is a second flowchart of a real-time network traffic classification method based on crawler technology according to an embodiment of the present invention;
FIG. 3 is a third flowchart of a real-time network traffic classification method based on crawler technology according to an embodiment of the present invention;
FIG. 4 is a fourth flowchart of a method for real-time network traffic classification based on crawler technology according to an embodiment of the present invention;
fig. 5 is a flowchart of an embodiment of the present invention for implementing a real-time network traffic classification method based on a crawler technology.
FIG. 6 is a schematic diagram of a real-time network traffic classification device based on a crawler technology according to an embodiment of the present invention;
FIG. 7 is a schematic diagram of an acquisition module of a real-time network traffic classification apparatus based on a crawler technology according to an embodiment of the present invention;
FIG. 8 is a schematic diagram of a processing module for implementing a real-time network traffic classification apparatus based on crawler technology according to an embodiment of the present invention;
FIG. 9 is a schematic diagram of an output module of a real-time network traffic classification apparatus based on a crawler technology according to an embodiment of the present invention;
FIG. 10 is a flowchart illustrating an embodiment of a real-time network traffic classification apparatus based on a crawler technology according to the present invention;
fig. 11 is a schematic diagram of an electronic device implementing a real-time network traffic classification apparatus based on a crawler technology according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The following is a practical example of an embodiment:
taking a gambling website as an example, due to the particularity of the website, most gambling websites are https encrypted messages, so the whole process is divided into two parts, the first part is an active learning part, keywords (such as 'jinsha', 'sun city', 'Venice game', 'dragon city', 'subsatellite' and the like) of the website are obtained by crawling the gambling website through a crawler, corresponding w, a and b are obtained from a keyword database according to the keywords and are compared with theta after weighted summation, if the sum is more than theta, the website is considered as a gambling website, and general characteristic values (ip, mac, port and domain name) corresponding to the website are written into a characteristic database for classification.
The second part is a classification process, and after the characteristic database is established, the messages in the network flow can be classified, and the flow classification can be directly performed after all the messages meet the requirements by extracting the information of ip, mac, port, domain name and the like in the messages and comparing the information with the corresponding characteristics in the characteristic database.
Example 1
As shown in fig. 1, a real-time network traffic classification method based on a crawler technology includes the following steps:
s110, acquiring a data source and a keyword feature library, wherein the keyword feature library comprises a threshold value, and the data source comprises a feature vector and a general feature value;
s120, calculating the eigenvector according to a factor weighted sum calculation method to obtain a weighted sum;
s130, when the weight sum is larger than the threshold value, the general characteristic value in the data source is obtained, and real-time network traffic is classified based on the general characteristic value.
The characteristic information of objects to be classified is extracted through filtering and screening of a well-designed crawler algorithm aiming at specific types of data of the Internet, the characteristic information is updated into a database in real time, and the network traffic can be rapidly classified by matching the characteristics on the basis of simple message analysis, so that the classification instantaneity of the network traffic is improved, the accuracy of the network traffic is also ensured, a better classification result can be obtained only by extracting a small number of key fields, the throughput of a network traffic classification function is remarkably improved, and the instantaneity of traffic classification can be better met.
Example 2
As shown in fig. 2, a real-time network traffic classification method based on a crawler technology includes:
s210, acquiring a data source and a keyword feature library, wherein the keyword feature library comprises a threshold value, and the data source comprises a feature vector and a general feature value;
s220, acquiring the feature vector in the data source according to a crawler technology;
s230, calculating the eigenvector according to a factor weighted sum calculation method to obtain a weighted sum;
s240, when the weight sum is larger than the threshold value, the general characteristic value in the data source is obtained, and real-time network traffic is classified based on the general characteristic value.
As can be seen from embodiment 2, the feature data is obtained according to the crawler technology, so that the obtained keywords are more comprehensive, and the active learning algorithm of the crawler technology mainly obtains the keywords from the data source in the process of crawling data by the crawler, compares the keywords with the preset keywords in the database, and obtains the total weight value score through the keywords to identify and classify the crawled network feature data.
Example 3
As shown in fig. 3, a real-time network traffic classification method based on a crawler technology includes:
s310, acquiring a data source and a keyword feature library, wherein the keyword feature library comprises a threshold value, and the data source comprises a feature vector and a general feature value;
s320, acquiring the keyword feature library, wherein the keyword feature library also comprises keywords and weights;
s330, traversing and matching the data source and the keywords to obtain matched feature vectors;
s340, inputting the feature vector into a calculation formula sigma (aw + b) > theta to obtain the weight and the theta, wherein w is the weight, theta is a threshold value, a is a proportional coefficient corresponding to the feature vector, and b is an initial estimated value;
s350, when the weight sum is larger than the threshold value, the general characteristic value in the data source is obtained, and real-time network traffic is classified based on the general characteristic value.
In embodiment 3, a crawler algorithm uses a factor weighted summation calculation method, that is, Σ (aw + b) > θ, keywords, weights, and thresholds are pre-stored through historical experience, data obtained by crawling a data source by the crawler is compared with the keywords, weighted calculation is performed after the keywords are matched, finally, weights of all the keywords are summed, and whether the weights are greater than the threshold of a certain classification is determined, where one keyword corresponds to a certain classification and has a corresponding weight and threshold (the weight of each keyword is not necessarily set to be the same), that is, one keyword corresponding to multiple classifications has multiple weights and thresholds, and when it is determined that a certain data source is classified into a certain classification, plaintext parts of the data source (where an encrypted data packet and an unencrypted data packet both have plaintext fields), such as ip, mac, port, http, and other feature values, are extracted.
As shown in fig. 4, a real-time network traffic classification method based on a crawler technology includes:
s410, when the weight sum is larger than the threshold value, classifying the data source;
s420, acquiring the general characteristic values in the data source, and establishing a characteristic database according to the general characteristic values;
s430, classifying the real-time network traffic based on the feature database;
and S440, returning to the keyword feature library for matching again when the weight sum is less than or equal to the threshold value.
In embodiment 4, it is determined whether the threshold value of a certain classification is greater than or not, where a keyword corresponds to a certain classification and has a corresponding weight and a threshold value (the weight setting of each keyword is not necessarily the same), that is, a keyword corresponding to multiple classifications has multiple weights and threshold values, when it is determined that a certain data source is classified into a certain classification, a plaintext portion (where both an encrypted data packet and an unencrypted data packet have plaintext fields), such as characteristic values of ip, mac, port, http domain name, and the like, of the data source is extracted, and when it is determined that the threshold value of a certain classification is less than, the plaintext portion is returned to the keyword database for re-matching.
Example 5
As shown in fig. 5, one specific embodiment may be:
s510, acquiring a data source and a keyword feature library, wherein the keyword feature library comprises a threshold value, and the data source comprises a feature vector and a general feature value;
through filtering and screening the data of specific types of the Internet by a well-designed crawler algorithm, the characteristic information of the objects to be classified is extracted and updated into a database in real time, and the network traffic can be rapidly classified by matching the characteristics on the basis of simply analyzing the messages
S520, calculating the eigenvector according to a factor weighted sum calculation method to obtain a weighted sum;
the crawler algorithm adopts a factor weighted summation calculation method, namely sigma (aw + b) > theta, keywords, weights and thresholds are prestored through historical experience, data acquired by a crawler crawling data source are compared with the keywords, weighted calculation is carried out after the keywords are matched, wherein a represents a feature vector corresponding proportion coefficient, if a is 0, the matching is not carried out, a is 0.1, the matching is carried out but the correlation is weak, a is 1, the matching is strong, finally, all keyword weights are summed, whether the keyword weights are larger than a threshold of a certain classification is judged, one keyword corresponds to a certain classification and has corresponding weights and thresholds (the weights of each keyword are not necessarily the same), namely, one keyword corresponding to a plurality of classifications has a plurality of weights and thresholds, when a certain data source is determined to belong to a certain classification, a plaintext part of the data source is extracted (both an encrypted data packet and a non-encrypted data packet have fields), such as ip, mac, port, http domain name, etc.
S530, when the weight sum is larger than the threshold value, the general characteristic value in the data source is obtained, and real-time network traffic is classified based on the general characteristic value.
Judging whether the weight of a keyword is larger than a threshold value of a certain classification, wherein one keyword corresponds to the certain classification and has a corresponding weight and a threshold value (the weight setting of each keyword is not necessarily the same), namely one keyword corresponding to a plurality of classifications has a plurality of weights and threshold values, when a certain data source is determined to belong to the certain classification, extracting a plaintext part (encrypted data packets and non-encrypted data packets both have plaintext fields), such as characteristic values of ip, mac, port, http domain names and the like, of the data source, and returning to the keyword database for re-matching when the judgment is smaller than the threshold value of the certain classification.
Example 6
As shown in fig. 6, a real-time network traffic classification device based on the crawler technology includes:
the acquisition module 10: the method comprises the steps of obtaining a data source and a keyword feature library, wherein the keyword feature library comprises a threshold value, and the data source comprises a feature vector and a general feature value;
the processing module 20: the characteristic vector is calculated according to a factor weighted sum calculation method to obtain a weighted sum;
the classification module 30: and when the weight sum is greater than the threshold value, acquiring the general characteristic value in the data source, and classifying the real-time network traffic based on the general characteristic value.
One embodiment of the above apparatus may be: the obtaining module 10 obtains a data source and a keyword feature library, the processing module 20 calculates the feature vector according to the threshold, the weight and the keyword in the keyword feature library obtained by the obtaining module 10 and the feature vector in the data source and according to a factor weighted sum calculation method to obtain a weight sum, and the classifying module 30 compares the weight sum with the threshold to classify.
Example 7
As shown in fig. 7, an obtaining module 10 of a real-time network traffic classification apparatus based on a crawler technology includes:
the first acquisition unit 12: the feature vector in the data source is obtained according to a crawler technology.
One embodiment of the acquisition module 10 of the above apparatus may be: the first acquisition unit 12 acquires a feature vector in the feature source.
Example 8
As shown in fig. 8, a processing module 20 of a real-time network traffic classification apparatus based on crawler technology includes:
the third acquisition unit 22: the keyword feature library is used for acquiring keywords and also comprises keywords and weights;
the matching unit 24: the characteristic vector matching device is used for traversing and matching the data source and the keywords to obtain the characteristic vector which is matched with the keywords;
the calculation unit 26: and the method is used for inputting the feature vector into a calculation formula sigma (aw + b) > theta to obtain the weight sum theta, wherein w is the weight, theta is a threshold value, a is the feature vector, and b is an estimated initial value.
One embodiment of the processing module 20 of the above apparatus may be: the third obtaining unit 22 obtains keywords, weights, and thresholds in a keyword feature library, the matching unit 24 performs traversal matching on the data source and the keywords to obtain the feature vectors matching the data source, and the calculating unit 26 performs calculation based on the obtained keywords, weights, and thresholds.
Example 9
As shown in fig. 9, a classification module 30 of a real-time network traffic classification apparatus based on a crawler technology includes:
first classification unit 32: for classifying the data source when the sum of weights is greater than the threshold;
the establishing unit 34: the characteristic database is used for acquiring the general characteristic values in the data source and establishing a characteristic database according to the general characteristic values;
second classification unit 36: for classifying the real-time network traffic based on the feature database;
matching subunit 38: and when the weight sum is less than or equal to the threshold value, returning to the keyword feature library for matching again.
One embodiment of the classification module 30 of the above apparatus may be: the first classification unit 32 classifies the data sources, the establishing unit 34 obtains general feature values in the data sources after the data sources are classified, so as to establish a feature database, the second classification unit 36 classifies network traffic by using the feature database, and the matching subunit 38 returns the keyword feature database for matching again when matching is not successful.
Example 10
As shown in fig. 10, one specific implementation may be:
s1010, acquiring a data source and a keyword feature library, wherein the keyword feature library comprises a threshold value, and the data source comprises a feature vector and a general feature value;
through filtering and screening the data of specific types of the Internet by a well-designed crawler algorithm, the characteristic information of the objects to be classified is extracted and updated into a database in real time, and the network traffic can be rapidly classified by matching the characteristics on the basis of simply analyzing the messages
S1020, calculating the eigenvectors according to a factor weighted sum calculation method to obtain a weighted sum;
the crawler algorithm adopts a factor weighted summation calculation method, namely sigma (aw + b) > theta, keywords, weights and thresholds are prestored through historical experience, data acquired by a crawler crawling data source are compared with the keywords, weighted calculation is carried out after the keywords are matched, wherein a represents a feature vector corresponding proportion coefficient, if a is 0, the matching is not carried out, a is 0.1, the matching is carried out but the correlation is weak, a is 1, the matching is strong, finally, all keyword weights are summed, whether the keyword weights are larger than a threshold of a certain classification is judged, one keyword corresponds to a certain classification and has corresponding weights and thresholds (the weights of each keyword are not necessarily the same), namely, one keyword corresponding to a plurality of classifications has a plurality of weights and thresholds, when a certain data source is determined to belong to a certain classification, a plaintext part of the data source is extracted (both an encrypted data packet and a non-encrypted data packet have fields), such as ip, mac, port, http domain name, etc.
S1030, when the weight sum is larger than the threshold value, the general characteristic value in the data source is obtained, and real-time network traffic is classified based on the general characteristic value.
Judging whether the weight of a keyword is larger than a threshold value of a certain classification, wherein one keyword corresponds to the certain classification and has a corresponding weight and a threshold value (the weight setting of each keyword is not necessarily the same), namely one keyword corresponding to a plurality of classifications has a plurality of weights and threshold values, when a certain data source is determined to belong to the certain classification, extracting a plaintext part (encrypted data packets and non-encrypted data packets both have plaintext fields), such as characteristic values of ip, mac, port, http domain names and the like, of the data source, and returning to the keyword database for re-matching when the judgment is smaller than the threshold value of the certain classification.
Example 11
As shown in fig. 11, an electronic device comprises a memory 1101 and a processor 1102, wherein the memory 1101 is used for storing one or more computer instructions, and the one or more computer instructions are executed by the processor 1102 to implement a real-time network traffic classification method based on a crawler technology.
It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process of the electronic device described above may refer to the corresponding process in the foregoing method embodiment, and is not described herein again.
A computer-readable storage medium storing a computer program which, when executed by a computer, implements a real-time network traffic classification method based on crawler technology as described above.
Illustratively, the computer program may be divided into one or more modules/units, which are stored in the memory 1101 and executed by the processor 1102 to implement the present invention. One or more modules/units may be a series of computer program instruction segments capable of performing certain functions, the instruction segments being used to describe the execution of a computer program in a computer device.
The computer device may be a desktop computer, a notebook, a palm computer, a cloud server, or other computing devices. The computer device may include, but is not limited to, a memory 1101, a processor 1102. Those skilled in the art will appreciate that the present embodiments are merely exemplary of a computing device and are not intended to limit the computing device, and may include more or fewer components, or some of the components may be combined, or different components, e.g., the computing device may also include input output devices, network access devices, buses, etc.
The processor 1102 may be a Central Processing Unit (CPU), other general purpose processor 1102, a digital signal processor 1102 (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, etc. The general purpose processor 1102 may be a microprocessor 1102 or the processor 1102 may be any conventional processor 1102 or the like.
The storage 1101 may be an internal storage unit of the computer device, such as a hard disk or a memory of the computer device. The memory 1101 may also be an external storage device of the computer device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) card, a flash card (FlashCard), etc. provided on the computer device. Further, the memory 1101 may also include both an internal storage unit and an external storage device of the computer device. The memory 1101 is used to store computer programs and other programs and data required by the computer device. The memory 1101 may also be used to temporarily store data that has been output or is to be output.
The above description is only an embodiment of the present invention, but the technical features of the present invention are not limited thereto, and any changes or modifications within the technical field of the present invention by those skilled in the art are covered by the claims of the present invention.

Claims (10)

1. A real-time network traffic classification method based on a crawler technology is characterized by comprising the following steps:
acquiring a data source and a keyword feature library, wherein the keyword feature library comprises a threshold value, and the data source comprises a feature vector and a general feature value;
calculating the eigenvector according to a factor weighted sum calculation method to obtain a weight sum;
and when the weight sum is larger than the threshold value, acquiring the general characteristic value in the data source, and classifying the real-time network traffic based on the general characteristic value.
2. The method according to claim 1, wherein a data source and a keyword feature library are obtained, the keyword feature library includes a threshold, the data source includes a feature vector and a general feature value, and further comprising:
and acquiring the feature vector in the data source according to a crawler technology.
3. The real-time network traffic classification method based on the crawler technology as recited in claim 1, wherein the computing of the feature vectors according to a factor weighted sum computation method to obtain a weighted sum comprises:
acquiring the keyword feature library, wherein the keyword feature library also comprises keywords and weights;
traversing and matching the data source and the keywords to obtain the matched feature vectors;
and inputting the feature vector into a calculation formula sigma (aw + b) > theta to obtain the weight and theta, wherein w is the weight, theta is a threshold value, a is a scale coefficient corresponding to the feature vector, and b is an initial estimated value.
4. The method according to claim 1, wherein when the sum of weights is greater than the threshold, the general eigenvalue in the data source is obtained, and real-time network traffic is classified based on the general eigenvalue, the method includes:
when the weight sum is larger than the threshold value, classifying the data source;
acquiring the general characteristic values in the data source, and establishing a characteristic database according to the general characteristic values;
classifying the real-time network traffic based on the feature database;
and when the weight sum is less than or equal to the threshold value, returning to the keyword feature library for matching again.
5. The utility model provides a real-time network traffic classification device based on crawler technique which characterized in that includes:
an acquisition module: the method comprises the steps of obtaining a data source and a keyword feature library, wherein the keyword feature library comprises a threshold value, and the data source comprises a feature vector and a general feature value;
a processing module: the characteristic vector is calculated according to a factor weighted sum calculation method to obtain a weighted sum;
a classification module: and when the weight sum is greater than the threshold value, acquiring the general characteristic value in the data source, and classifying the real-time network traffic based on the general characteristic value.
6. The device for real-time network traffic classification based on the crawler technology according to claim 5, wherein the obtaining module specifically comprises:
a first acquisition unit: the feature vector in the data source is obtained according to a crawler technology.
7. The device for real-time network traffic classification based on the crawler technology according to claim 5, wherein the processing module specifically comprises:
a third acquisition unit: the keyword feature library is used for acquiring keywords and also comprises keywords and weights;
a matching unit: the characteristic vector matching device is used for traversing and matching the data source and the keywords to obtain the characteristic vector which is matched with the keywords;
a calculation unit: and the method is used for inputting the feature vector into a calculation formula sigma (aw + b) > theta to obtain the weight sum theta, wherein w is the weight, theta is a threshold value, a is the feature vector, and b is an estimated initial value.
8. The real-time network traffic classification device based on the crawler technology as recited in claim 5, wherein the classification module comprises:
a first classification unit: for classifying the data source when the sum of weights is greater than the threshold;
the establishing unit: the characteristic database is used for acquiring the general characteristic values in the data source and establishing a characteristic database according to the general characteristic values;
a second classification unit: for classifying the real-time network traffic based on the feature database;
a matching subunit: and when the weight sum is less than or equal to the threshold value, returning to the keyword feature library for matching again.
9. An electronic device comprising a memory and a processor, the memory configured to store one or more computer instructions, wherein the one or more computer instructions are executed by the processor to implement a crawler technology-based real-time network traffic classification method according to any one of claims 1 to 4.
10. A computer-readable storage medium storing a computer program, wherein the computer program is configured to enable a computer to execute the method for real-time classification of network traffic based on crawler technology according to any one of claims 1 to 4.
CN202011274274.1A 2020-11-15 2020-11-15 Real-time network traffic classification method based on crawler technology Active CN112380406B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011274274.1A CN112380406B (en) 2020-11-15 2020-11-15 Real-time network traffic classification method based on crawler technology

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011274274.1A CN112380406B (en) 2020-11-15 2020-11-15 Real-time network traffic classification method based on crawler technology

Publications (2)

Publication Number Publication Date
CN112380406A true CN112380406A (en) 2021-02-19
CN112380406B CN112380406B (en) 2022-11-18

Family

ID=74582506

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011274274.1A Active CN112380406B (en) 2020-11-15 2020-11-15 Real-time network traffic classification method based on crawler technology

Country Status (1)

Country Link
CN (1) CN112380406B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104022920A (en) * 2014-06-26 2014-09-03 重庆重邮汇测通信技术有限公司 LTE (long term evolution) network flow recognition system and method
CN107465643A (en) * 2016-06-02 2017-12-12 国家计算机网络与信息安全管理中心 A kind of net flow assorted method of deep learning
US20180006912A1 (en) * 2016-06-30 2018-01-04 At&T Intellectual Property I, L.P. Methods and apparatus to identify an internet domain to which an encrypted network communication is targeted
CN107967311A (en) * 2017-11-20 2018-04-27 阿里巴巴集团控股有限公司 A kind of method and apparatus classified to network data flow

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104022920A (en) * 2014-06-26 2014-09-03 重庆重邮汇测通信技术有限公司 LTE (long term evolution) network flow recognition system and method
CN107465643A (en) * 2016-06-02 2017-12-12 国家计算机网络与信息安全管理中心 A kind of net flow assorted method of deep learning
US20180006912A1 (en) * 2016-06-30 2018-01-04 At&T Intellectual Property I, L.P. Methods and apparatus to identify an internet domain to which an encrypted network communication is targeted
CN107967311A (en) * 2017-11-20 2018-04-27 阿里巴巴集团控股有限公司 A kind of method and apparatus classified to network data flow

Also Published As

Publication number Publication date
CN112380406B (en) 2022-11-18

Similar Documents

Publication Publication Date Title
Karatas et al. Increasing the performance of machine learning-based IDSs on an imbalanced and up-to-date dataset
Salo et al. Dimensionality reduction with IG-PCA and ensemble classifier for network intrusion detection
Peng et al. Intrusion detection system based on decision tree over big data in fog environment
Ring et al. Flow-based network traffic generation using generative adversarial networks
Yu et al. PBCNN: packet bytes-based convolutional neural network for network intrusion detection
US7690037B1 (en) Filtering training data for machine learning
Krishnaveni et al. Ensemble approach for network threat detection and classification on cloud computing
Garg et al. HyClass: Hybrid classification model for anomaly detection in cloud environment
Cheng et al. DDoS Attack Detection via Multi-Scale Convolutional Neural Network.
Ali et al. A review of current machine learning approaches for anomaly detection in network traffic
Groleat et al. Hardware acceleration of SVM-based traffic classification on FPGA
CN113378899A (en) Abnormal account identification method, device, equipment and storage medium
Wang et al. Fcnn: An efficient intrusion detection method based on raw network traffic
Alzahrani et al. A novel method for feature learning and network intrusion classification
CN113268735B (en) Distributed denial of service attack detection method, device, equipment and storage medium
Fernando et al. Network attacks identification using consistency based feature selection and self organizing maps
Harbola et al. Improved intrusion detection in DDoS applying feature selection using rank & score of attributes in KDD-99 data set
Ketepalli et al. Data Preparation and Pre-processing of Intrusion Detection Datasets using Machine Learning
Lee et al. ATMSim: An anomaly teletraffic detection measurement analysis simulator
Min et al. Online Internet traffic identification algorithm based on multistage classifier
Kshirsagar et al. Intrusion detection using rule-based machine learning algorithms
CN112380406B (en) Real-time network traffic classification method based on crawler technology
Parvat et al. Performance improvement of deep packet inspection for Intrusion Detection
Zhao et al. Analysis about performance of multiclass SVM applying in IDS
Wu Protocol-based classification for intrusion detection

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant