Summary of the invention
The present invention provides network behavior classification method and devices.Using the present invention, can be realized efficient and/or real-time
Network behavior classification.
According to one embodiment of present invention, a kind of method for network behavior classification is provided comprising: obtain the
One data set, each data entry in first data set include the characteristic for indicating network behavior and corresponding label
Information;Clustering processing is carried out to obtain one or more clusters to the characteristic in first data set;According to described one
A or multiple clusters are several to obtain second to reject data entry belonging to the characteristic to peel off from first data set
According to collection;And it is trained using characteristic included by the data entry in second data set and corresponding label information
Sorter model.
On the one hand, the method further includes: one or more network rows are extracted from primitive network behavioral data
It is characterized;Determine the correlation between one or more of network behavior features;And first data set is generated, it is described
First data set includes the characteristic of the network behavior feature with low correlation each other.
On the one hand, it includes: to reject that data entry belonging to the characteristic to peel off is rejected from first data set
Deviate data entry belonging to the characteristic of cluster centre in one or more of clusters;Or if spy in a cluster
The quantity for levying data is less than threshold value, then abandons data entry belonging to the characteristic in the cluster.
On the one hand, the method further includes: using housebroken sorter model to user network collected
Behavioural information is classified.
On the one hand, the training sorter model comprises determining that whether the training to the sorter model restrains;With
And the sorter model described in deconditioning when determining the training convergence.
On the one hand, occur it is following one or more when, determine the training convergence: the error of the sorter model
Reach threshold value;The classification accuracy of the sorter model reaches threshold value;And the cycle-index of the training reaches threshold value.
On the one hand, the characteristic include it is following one or more: IP address, port, network protocol, outlet stream
Amount, inlet flow rate, duration, website information.
On the one hand, the classification of the sorter model include it is following one or more: browsing webpage, viewing video, under
Published article part, release information, abnormal transaction, network fraud, hacker attack.
According to another embodiment of the invention, a kind of device for network behavior classification is provided comprising: data
Module is obtained, for obtaining the first data set, each data entry in first data set includes indicating network behavior
Characteristic and corresponding label information;Data preprocessing module, for being carried out to the characteristic in first data set
Clustering processing is rejected from first data set with obtaining one or more clusters according to one or more of clusters
Data entry belonging to the characteristic to peel off is to obtain the second data set;And categorization module, for using second number
Characteristic included by data entry according to concentration and corresponding label information train sorter model.
On the one hand, the data acquisition module be further configured to from primitive network behavioral data extract one or
Multiple network behavior features;Determine the correlation between one or more of network behavior features;And generate described first
Data set, first data set include the characteristic of the network behavior feature with low correlation each other.
On the one hand, the data preprocessing module is rejected from first data set belonging to the characteristic to peel off
Data entry includes: to reject data entry belonging to the characteristic of deviation cluster centre in one or more of clusters;Or
If the quantity of the characteristic in one cluster of person is less than threshold value, data strip belonging to the characteristic in the cluster is abandoned
Mesh.
On the one hand, the categorization module is configured to: using housebroken sorter model to user network collected
Network behavioural information is classified.
On the one hand, the categorization module is configured to: whether determination restrains the training of sorter model;And true
Sorter model described in deconditioning when the fixed training convergence.
On the one hand, occur it is following one or more when, determine the training convergence: the error of the sorter model
Reach threshold value;The classification accuracy of the sorter model reaches threshold value;And the cycle-index of the training reaches threshold value.
On the one hand, the characteristic include it is following one or more: IP address, port, network protocol, outlet stream
Amount, inlet flow rate, duration, website information.
On the one hand, the classification of the sorter model include it is following one or more: browsing webpage, viewing video, under
Published article part, release information, abnormal transaction, network fraud, hacker attack.
According to another embodiment of the invention, a kind of system for network behavior classification is provided comprising: processing
Device;For the memory of storage processor executable instruction, wherein the processor, which is configured to execute the processor, to be held
Row instruction is to realize method as described above.
According to an aspect of the present invention, Feature Selection and feature phase are passed through to the network behavior characteristic of primary acquisition
The detection of closing property, can reject unrelated and/or redundancy characteristic information, and not only avoiding redundancy feature interferes with each other, but also number
It is reduced according to intrinsic dimensionality, the high efficiency for follow-up data processing lays the foundation.
According to another aspect of the present invention, extraneous data or error number are removed by the clustering processing of unsupervised learning
According to the interference to subsequent classification algorithm, the data volume of network behavior sorting algorithm processing is significantly reduced, and after improving
The accuracy of continuous network behavior sorting algorithm.
According to another aspect of the present invention, suitable convergence point is selected during classifier training, reaches convergence point
It is returned to model result, model over-fitting can be effectively reduced, improves the accuracy of disaggregated model.
Any of the above or multiple technologies feature may be implemented in the embodiment of the present invention, so that network behavior of the invention
Classification method and device be able to solve in the prior art network behavior classification inaccuracy and/or not in time the problem of.For example, with
When family is surfed the Internet, the network behavior of user can be identified in time and accurately during user carries out network activity, thus
Corresponding measure is taken when needing.
Specific embodiment
The invention will be further described with attached drawing combined with specific embodiments below, but guarantor of the invention should not be limited with this
Protect range.
The present invention provides network behavior classification method and devices.On the one hand, network behavior classification method of the invention and
Device can accurately identify various user network behaviors, different so as to correctly be taken according to heterogeneous networks behavior
Specification measure.On the other hand, network behavior classification method of the invention and device can identify various user networks in real time
Behavior, so as to which harmful network behavior is effectively blocked or intervened when network behavior is occurring.As a result, according to this hair
Bright network behavior classification method and device are able to solve any in the accuracy and timeliness the two of network behavior classification
One problem solves both of these problems simultaneously, has wide applicability.
Fig. 1 is the flow chart of network behavior classification method according to an embodiment of the invention.
Step 102: extracting network behavior feature.User can generate various when carrying out network activity (that is, network behavior)
The relevant information (for example, attribute) of various kinds, for example, to transaction, browsing webpage, the relevant information of viewing video etc..It can collect
In the case where network of relation behavioural information, useful network behavior characteristic can be therefrom extracted, such as user information,
Transaction amount, used facility information, network flow etc..
Step 104: data prediction being carried out to extracted network behavior feature, obtains data set.Data prediction can
Including data cleansing, missing values processing, data transformation etc., so that extracted network behavior characteristic meets subsequent processing
Requirement.
Step 106: classification processing being carried out to the data set obtained after data prediction using classifier.Classifier is available
In distinguishing network behavior classification representated by network behavior feature, for example whether for arm's length dealing, whether having fraud, user
It surfs the Internet the carried out class of activity (for example, seeing video, browsing webpage, transmission message etc.).
Step 108: sorter model being assessed, terminates net if through assessment (for example, meeting evaluation index)
Otherwise network behavior assorting process returns to step 104.
The operation of classifier includes classifier training stage and classifier application stage.In the classifier training stage, pass through
Above-mentioned steps 102-108 usage history data set is suitble to one or more sorter models of solution goal task to train, and
Verifying and offline evaluation are carried out to model, preferable sorter model is then determined by evaluation index.Rank is applied in classifier
Section, can equally through the above steps 102-108 freshly harvested data are input to trained sorter model, so that it may it is defeated
Classification results out.The new data and classification results can also be further used for classification of assessment device model (that is, online evaluation).According to
Two classification may be implemented in classifier of the invention, and classify also may be implemented more.
Above each step in classifier training stage is described in detail in 2-4 referring to the drawings.
Fig. 2 is the schematic diagram of network behavior feature extracting method according to an embodiment of the invention.
Step 201, the primitive network behavioral data of user is acquired.User can generate various network behaviors during online
Information, IP address, port, network protocol, rate of discharge, inlet flow rate, duration, website information etc..It can acquire
These network behavior data of user are to form primitive network behavioral data.
Step 202, one or more network behavior features are extracted from primitive network behavioral data to form characteristic
Collection.Continue the example above with respect to network behavior, the characteristic features information of network behavior can be filtered out.In screening process
The feature that can represent network behavior can be retained and rejected and classified unrelated information with network behavior.For example, entering and leaving flow information
It may indicate that and what kind of network behavior is occurring (for example, browsing webpage, viewing video, downloading file, release information
Deng), and the specific category of port information and network behavior may be not directly dependent upon.Thus, it is possible in primitive network behavior number
Flow information is entered and left according to middle reservation and rejects port information to form characteristic data set.
Step 203, correlation analysis is done to characteristic data set, rejects the characteristic of redundancy.Similar characteristic attribute it
Between correlation it is higher, may be redundancy for subsequent network behavior classification, therefore can only retain one of or will
A variety of relevant characteristics merge.For example, the correlation between age of user and date of birth is higher, can only select
Retain one of feature or both characteristics are merged into a kind of new feature.As another example, network per second
Correlation between flow and every 2 seconds network flows is higher, can only select to retain one of feature or both are special
Sign data are merged into a kind of new feature (for example, being averaged).By rejecting the data characteristics of redundancy, it can be generated and simplify feature
Data set comprising the characteristic of the network behavior feature with low correlation each other.
Step 204, judge whether the correlation between each network behavior feature remained meets the requirements (for example,
Correlation is less than threshold value), characteristic data set is simplified if it is, exporting, otherwise return step 202.
According to an aspect of the present invention, Feature Selection and feature phase are passed through to the network behavior characteristic of primary acquisition
The detection of closing property, can reject unrelated and/or redundancy characteristic information, guarantee the mutual correlation of extracted each feature
It is very low, to not only reduce data volume to be treated, avoids redundancy feature and interfere with each other, improve data analysis result
Reliability, but also data characteristics dimension reduces, the high efficiency for follow-up data processing lays the foundation.
On the contrary, without Feature Selection and/or feature correlation detection, if directly using collected
Primitive attribute is classified as feature, these attributes will lead to classification results there may be redundancy and invalid information
Inaccuracy.According to the present invention, feature extraction and screening, correlation detection, institute are carried out to primitive network behavioral data obtained
The network behavior characteristic information of acquisition can more effectively represent network behavior occurred, to improve the standard of subsequent classification
True property.
Fig. 3 is the schematic diagram of data preprocessing method according to an embodiment of the invention.
Step 301, characteristic data set is obtained, each data entry in this feature data set may include indicating network behavior
Characteristic and corresponding label information.For example, can obtain step 202 generate characteristic data set or step 203,
Characteristic data set is simplified in 204 generations.It in one embodiment, can be to characteristic data set for web-based history behavioral data
Artificial mark is carried out, determines that each user network behavior belongs to which kind of classification (that is, label information), for example, browsing webpage, viewing
Video, downloading file, release information (for example, publication duplicate message, publication false propaganda), abnormal transaction, network fraud, hacker
Attack etc..
Step 302, it selects and initializes clustering algorithm.For example, it is pre- to carry out data using unsupervised clustering algorithm
Processing.Non-limiting as example, clustering algorithm may include K-MEANS (K mean value) algorithm, BIRCH algorithm, DBSCAN algorithm
Deng.
Step 303, clustering processing is carried out to obtain one or more clusters to the characteristic in this feature data set.It can
Cluster centre and parameter are determined according to selected clustering algorithm.Cluster can be unsupervised learning, that is, not need to indicate number
Classification or i.e. label information according to classification.Therefore, in cluster process, the characteristic in this feature data set can be used only
According to without using label information.
Step 304, determine whether cluster restrains, it, should if convergence (that is, successfully obtaining one or more cluster clusters)
Process advances to step 305, otherwise return step 303, re-starts clustering processing.
Step 305, it rejects and peels off data entry belonging to characteristic to obtain new data set.For example, can obtain every
Each characteristic in a cluster is rejected and is deviated considerably from belonging to the characteristic of cluster centre to the distance of its cluster centre
Data entry.In another embodiment, data entry belonging to the characteristic in smaller cluster can also be abandoned, for example, such as
The quantity of characteristic in one cluster of fruit is less than threshold value, then can abandon data strip belonging to the characteristic in the cluster
Mesh.
The characteristic that peels off may be dirty data or invalid data, there is the accuracy that will affect subsequent classification.As a result,
Characteristic finds respective cluster point after cluster, then weeds out dirty data and invalid data, and doing so can be very big
Degree avoids dirty data bring from adversely affecting.
If not using unsupervised approaches to be clustered to reject part dirty data and invalid data, it will influence classification
Device training, to influence the accuracy of gained classifier.On the contrary, in the present invention, by unsupervised learning clustering algorithm to spy
It levies data and carries out clustering processing, remove the interference that unrelated/invalid data handles subsequent classification, the number of classification processing can be reduced
According to amount, treatment effeciency is improved.Influence of the dirty data to sorting algorithm can be additionally reduced, can be improved user network behavior point
The accuracy of class.It is non-limiting as example, certain user viewing news, flow should be it is lesser, but Web page picture and
Video generates high flow, so that the flow of the user network behavior deviates from normal discharge.It, can be by this by clustering algorithm
Kind of abnormal high flow capacity behavior is rejected as dirty data, thus will not influence it is subsequent according to flow to determine whether belonging to viewing newly
The accuracy of news behavior.
Fig. 4 is the schematic diagram of classifier training method according to an embodiment of the invention.More specifically, Fig. 4 is shown
Usage history data set trains the process of classifier.It is, for example, possible to use the numbers obtained by data prediction shown in Fig. 3
Classifier is trained as history data set according to collection.
Step 401, history data set can be divided into training set and test set.History data set may include multiple data
Entry (for example, 1000 datas, every data corresponds to the primary network behavior of a user), and every data includes table
Show network behavior characteristic and corresponding label information (such as, if be browsing webpage, whether be read news, whether
To watch video, whether being publication deceptive information etc.).The label information of every data can be to be obtained by artificial mark mode
's.It is non-limiting as example, historical data can be concentrated 80% data entry as training set, and by 20% number
According to entry as test set.
Step 402, selection sort device model and initiation parameter.It can according to need and practice to select or design and be suitble to
Sorter model, such as the classifier based on linear function or distance function, the classifier based on decision tree or neural network.
The present invention is unrestricted in this regard.
Step 403, sorter model is trained using training set.It is, for example, possible to use the data entry institutes in training set
Including characteristic and corresponding label information train sorter model.Sorter model training be a parameter learning and
The loop iteration process of tuning, enables finally obtained sorter model to be fitted the data of training set well.
Step 404, judge whether the training to sorter model restrains, advance to step 405 if convergence, otherwise return
Continue to train to step 403.It can judge whether the training to sorter model restrains there are many mode, such as can determine pair
Whether the training of sorter model reaches preset convergence threshold.Non-limiting as example, convergence threshold can be for example:
(1) error: if the error of sorter model is less than threshold value, it is believed that the training to sorter model restrains.
Error indicates the difference between actual prediction output and the true output of sample, such as least square can be used for curve matching
Method determines error.
(2) accuracy rate: if the classification accuracy of model reaches threshold value (for example, 90%), it is believed that classifier mould
The training convergence of type.
(3) cycle-index: if the cycle-index (for example, arameter optimization) being trained using training set reaches threshold value
(for example, 50 times), then it is believed that the training to sorter model restrains.
Can according to need with concrete practice select one or more convergence thresholds and/or be arranged convergence threshold it is specific
Value, and be not limited to be given above specific example.For example, the combination of error and cycle-index can be used, once meet wherein
Any one criterion is considered as restraining the training of sorter model and advances to step 405.In practice can also dynamic or
The value of convergence threshold and/or convergence threshold is adjusted in real time.
Step 405, the accuracy rate of testing classification device model is carried out using test set, if (for example, 80%) up to standard, indicates to divide
Class device is trained successfully, and sorter model is saved;If not up to standard, return to step 402 and reselect and train sorter model.
As can be seen that classifier is relatively high to accuracy rate requirement, if accuracy rate is below standard, accordingly from the training process of classifier
Sorter model cannot may obtain reasonable classification results in the application, sorter model should be reselected and divided
Class.
In the prior art, the model of classifier training may be because over-fitting and cause generalization ability poor, so that real
Border classification accuracy is lower.During classifier training according to the present invention, suitable convergence threshold is selected, reaches convergence threshold
Value be returned to sorter model as a result, rather than allow classifier always training go down, model over-fitting can be effectively reduced
Bring training set accuracy rate is high, and the problem that test set accuracy rate is lower.
Freshly harvested network behavior data can be then applied to (that is, answering according to Fig. 4 training sorter model obtained
With the stage).It, can be by step 102 according to new network behavior/Event Distillation user network row in the sorter model application stage
It is characterized, necessary data prediction is carried out by step 104, it is special to pass through pretreated user network behavior in step 106
It levies data and inputs sorter model, so that it may export user network behavior type from sorter model.For example, when user connects net
When network, by the way that user network behavioural information collected is input to trained sorter model, it can determine user's
Whether network behavior is browsing webpage, viewing video, downloading file, release information, abnormal transaction, network fraud, hacker attack
Deng.As a result, when classifier exports the classification of bad network behavior, it can take appropriate measures and block the network behavior of user.
The new network behavior feature and classification results are (in conjunction with other feedback informations, such as subsequent confirmation or the judgement for denying classifier
As a result) classification of assessment device model further can also be used in step 108.If sorter model is fitted under the performance of new data
Drop then can carry out re -training to model.
Fig. 5 is the block diagram of network behavior sorter 500 according to an embodiment of the invention.Network behavior classification dress
Setting 500 may include data acquisition module 501, data preprocessing module 502, categorization module 503.Data acquisition module 501 can be used
Each data entry in acquisition (for example, receive or generate) the first data set, the first data set includes indicating network behavior
Characteristic and corresponding label information.Data acquisition module 501 can also be generated according to the method described above by reference to Fig. 2
First data set.For example, data acquisition module 501 can extract one or more network behaviors spies from primitive network behavioral data
Sign determines the correlation between the one or more network behavior feature, and generating includes the net each other with low correlation
First data set of network behavioural characteristic.For example, to be examined by Feature Selection and feature correlation to the data attribute of primary acquisition
It surveys, if correlation detection does not pass through, to continue to screen feature, until reaching the requirement of correlation detection.It in this way can be effective
Reduce the excessively high bring information redundancy of correlation between characteristic attribute.
Data preprocessing module 502 can be used for carrying out the characteristic in the first data set clustering processing to obtain one
Or multiple clusters, and data belonging to the characteristic to peel off are rejected from the first data set according to the one or more cluster
Entry is to obtain the second data set.For example, data preprocessing module 502 can be according to the method described above by reference to Fig. 3 come to
One data set carries out clustering processing, rejects data belonging to the characteristic for deviateing cluster centre in one or more of clusters
Entry, or if the quantity of the characteristic in a cluster is less than threshold value, abandon belonging to the characteristic in the cluster
Data entry.Clustering processing is carried out to characteristic by unsupervised learning clustering algorithm, removes unrelated/invalid data to subsequent
The interference of classification processing can reduce the data volume of classification processing, improve treatment effeciency.Can additionally reduce dirty data to point
The influence of class algorithm can be improved the accuracy of user network behavior classification.
Categorization module 503 may be used in the second data set to train sorter model.Categorization module 503 can be more than
Sorter model is trained referring to the method for Fig. 4 description, such as can determine whether the training to sorter model reaches convergence threshold
Value, and the deconditioning sorter model when reaching convergence threshold.By selecting suitable convergence threshold, convergence threshold is reached
Sorter model is returned to as a result, model over-fitting bring training set accuracy rate height can be effectively reduced, and test set
The lower problem of accuracy rate.In addition, housebroken sorter model can be used for freshly harvested user network behavioural information into
Row classification.
The invention proposes efficient network behavior classification methods, can reduce sorter model while guaranteeing accuracy rate
Performance cost.Network behavior classification method of the invention and device may be especially suitable for identifying abnormal network row in real time
For.General network behavior classification method is classified according to the flow of the network user, is had the disadvantage in that
1. primitive attribute collected is directly used to classify as feature, there may be redundancies in these attributes
And invalid information, it will lead to that data volume is larger and classification results are inaccurate;
2. characteristic is concentrated, there may be abnormal datas, are also likely to be present label by the data set that artificial mark obtains
The situation of information inaccuracy, these dirty datas and invalid data will affect classification results;
3. the model of classifier training may be because over-fitting and cause actual classification accuracy rate very low.
The combination of the various technical characteristics used through the invention can in real time, accurately identify various network rows
For, avoid one of above-mentioned technological deficiency or a variety of, so as to when network behavior is occurring effectively block or
Intervene harmful network behavior.
Network behavior classification method described above and each step and module of device can with hardware, software or its
Combination is to realize.If realized within hardware, various illustrative steps, module and the circuit described in conjunction with the present invention is available
General processor, digital signal processor (DSP), specific integrated circuit (ASIC), field programmable gate array (FPGA) or its
His programmable logic components, hardware component, or any combination thereof realize or execute.General processor can be processor, micro-
Processor, controller, microcontroller or state machine etc..If realized in software, in conjunction with the various explanations of the invention described
Property step, module can be used as one or more instruction or code and may be stored on the computer-readable medium or be transmitted.It realizes
The software module of various operations of the invention can reside in storage medium, such as RAM, flash memory, ROM, EPROM, EEPROM, deposit
Device, hard disk, removable disk, CD-ROM, cloud storage etc..Storage medium can be coupled to processor so that the processor can be from/to
The storage medium reading writing information, and corresponding program module is executed to realize each step of the invention.Moreover, software-based
Embodiment can be uploaded, download or remotely be accessed by means of communication appropriate.This means of communication appropriate includes example
As internet, WWW, Intranet, software application, cable (including fiber optic cables), magnetic communication, electromagnetic communication are (including RF, micro-
Wave and infrared communication), electronic communication or other such means of communication.
It shall yet further be noted that these embodiments are probably as the process for being depicted as flow chart, flow graph, structure chart or block diagram
Come what is described.Although all operations may be described as sequential process by flow chart, many of these operations operation can
It executes parallel or concurrently.In addition, the order of these operations can be rearranged.
Disclosed methods, devices and systems should not be limited in any way.On the contrary, the present invention cover it is various disclosed
Embodiment (individually and various combinations with one another and sub-portfolio) all novel and non-obvious feature and aspects.Institute is public
The methods, devices and systems opened are not limited to any specific aspect or feature or their combination, disclosed any embodiment
It does not require the existence of any one or more specific advantages or solves specific or all technical problems.
The embodiment of the present invention is described with above attached drawing, but the invention is not limited to above-mentioned specific
Embodiment, the above mentioned embodiment is only schematical, rather than restrictive, those skilled in the art
Under the inspiration of the present invention, without breaking away from the scope protected by the purposes and claims of the present invention, it can also make very much
Change, these are within the scope of the present invention.