CN109936582A

CN109936582A - Construct the method and device based on the PU malicious traffic stream detection model learnt

Info

Publication number: CN109936582A
Application number: CN201910333902.XA
Authority: CN
Inventors: 王海; 涂威威
Original assignee: 4Paradigm Beijing Technology Co Ltd
Current assignee: 4Paradigm Beijing Technology Co Ltd
Priority date: 2019-04-24
Filing date: 2019-04-24
Publication date: 2019-06-25
Anticipated expiration: 2039-04-24
Also published as: CN109936582B

Abstract

The invention discloses a kind of method and device of the building based on the PU malicious traffic stream detection model learnt, relate to network technique field, main purpose be to construct it is a kind of can be based on the detection model of the malicious traffic stream of machine learning.The main technical solution of the present invention are as follows: obtain data on flows as sample data set；Multiple candidate families are obtained based on sample data set training；Assessment collection is constructed based on the sample data set；Each candidate family is assessed respectively according to assessment collection and default evaluation condition, obtains the assessment result for corresponding to each candidate family；Selection assessment result meets the candidate family of preset condition；Selected model is integrated according to preset integrated approach, obtains malicious traffic stream detection model.The present invention is for realizing to the process constructed in malicious traffic stream detection process to malicious traffic stream detection model.

Description

Construct the method and device based on the PU malicious traffic stream detection model learnt

Technical field

The present invention relates to network technique fields more particularly to a kind of building based on the PU malicious traffic stream detection model learnt Method and device and a kind of malicious traffic stream detection method and device.

Background technique

With the continuous development of network technology and the work and life of people is more and more closely bound up with network, in network Flow also gradually increase, wherein be usually present malicious traffic stream in a network, will affect the normal fortune of site databases or system Row, such as the network flows such as the fraud of network attack, flow, malice crawler are common malicious traffic stream, and such malicious traffic stream is logical Often unauthorized business datum or information are invaded, interfere or grabbed by unauthorized mode.Danger based on malicious traffic stream Property is done harm to, expert has more paid attention to the detection of malicious traffic stream in domain.

Currently, in the detection process of existing malicious traffic stream, commonly used by detection mode be all based on preset rules into Row detection, for example, the feature by extracting malicious traffic stream detects network flow and is judged as judgment basis, but Under the premise of the network flow of magnanimity instantly, by existing mode in the detection process, either detection effect also It is that artificial intervention is all excessively relied on detection efficiency, this allows for needing when in face of the current network data for facing magnanimity Spend the resources such as more human and material resources.And with the continuous progress of technology, artificial intelligence technology also gradually develops.Wherein, machine Device study is the inevitable outcome that artificial intelligence study develops to certain phase, is dedicated to the means by calculating, utilizes experience To improve the performance of system itself.In computer systems, " experience " exists usually in the form of " data ", is calculated by machine learning Method can generate " model " from data, that is to say, that empirical data is supplied to machine learning algorithm, it can be based on these warps It tests data and generates model, when facing news, model can provide corresponding judgement, that is, prediction result.Therefore, based on existing Some detection modes are difficult to meet the needs of current malicious traffic stream detection, how to realize that one kind can be based on the malice of machine learning The detection of flow becomes urgent problem to be solved in the industry.

Summary of the invention

In view of the above problems, the invention proposes a kind of method of the building based on the PU malicious traffic stream detection model learnt and Device, main purpose are to realize that one kind can be based on the malicious traffic stream detection method that machine learning is automated, to subtract Few artificial consumption.

In order to achieve the above objectives, present invention generally provides following technical solutions:

On the one hand, the present invention provides a kind of method of the building based on the PU malicious traffic stream detection model learnt, specifically includes:

Obtain data on flows and be used as sample data set, sample data concentration include the positive sample data with positive label with Unmarked sample data without label, wherein positive tag representation malicious traffic stream；

Multiple candidate families are obtained based on sample data set training；

Assessment collection is constructed based on the sample data set；

Each candidate family is assessed respectively according to assessment collection and default evaluation condition, is obtained corresponding every The assessment result of a candidate family；

Selection assessment result meets the candidate family of preset condition；According to preset integrated approach to selected model into Row is integrated, obtains malicious traffic stream detection model.

Optionally, obtaining multiple candidate families based on sample data set training includes:

Multiple training sets are constructed based on the sample data set；

It is selected respectively from the set and the multiple training set that the set of machine learning algorithm, hyper parameter combine It selects, training obtains multiple candidate families；Wherein, a kind of machine learning algorithm, one group of hyper parameter and a training set determine one Candidate family.

Optionally, described to include: based on the multiple training sets of sample data set building

A positive sample training subset is constructed based on at least partly positive sample data that the sample data is concentrated, to described The unmarked sample data that sample data is concentrated carries out multiple repairing weld operation and constructs multiple negative sample training subsets, by the positive sample This training subset and the multiple negative sample training subset are respectively combined to obtain multiple training sets；

Alternatively,

Multiple positive sample training subsets are constructed based on at least partly positive sample data that the sample data is concentrated, to described The unmarked sample data that sample data is concentrated be employed many times operation and constructs multiple negative sample training subsets, will each positive sample This training subset and the multiple negative sample training subset are respectively combined to obtain multiple training sets.

Optionally, described to include: based on sample data set construction assessment collection

Sampling building positive sample assessment subset is carried out to the positive sample data that the sample data is concentrated, to the sample number Sampling building negative sample assessment subset is carried out according to the unmarked sample data of concentration, positive sample is assessed into subset and negative sample is assessed Sub-combinations obtain assessment collection.

Optionally, the sample data set construction assessment collection that is based on includes: more based on sample data set building A assessment collection, wherein it includes positive sample data and the unmarked sample data as negative sample data that each assessment, which is concentrated,；

It is described that each candidate family is assessed respectively according to assessment collection and default evaluation condition, it obtains pair Answer the assessment result of each candidate family, comprising: for each candidate family, collect and preset assessment item according to the multiple assessment Part respectively assesses the candidate family, obtains multiple assessment results, merges the multiple assessment result and obtains candidate's mould The corresponding final assessment result of type.

Optionally, when the default evaluation condition is maximal margin method, the assessment result of each candidate family of correspondence It is the class interval of prediction result of each candidate family on assessment collection；

The candidate family that the selection assessment result meets preset condition includes: the class interval for selecting corresponding prediction result Greater than the candidate family of preset value.

Optionally, the default evaluation condition is the assessment of each candidate family of correspondence when calculating the method for AUC value The result is that AUC value of each candidate family on assessment collection；

The candidate family that the selection assessment result meets preset condition includes: that selection corresponding A UC value is greater than preset value Candidate family.

Optionally, described that selected model is integrated according to preset integrated approach, obtain malicious traffic stream detection Model includes:

It is the corresponding weighted value of each selected candidate family distribution according to corresponding assessment result, and according to weighted value Selected candidate family is integrated.

On the other hand, the present invention provides a kind of device of the building based on the PU malicious traffic stream detection model learnt, specific to wrap It includes:

Acquiring unit, for obtaining data on flows as sample data set, the sample data concentration includes with positive label Positive sample data and unmarked sample data without label, wherein positive tag representation malicious traffic stream；

Training unit, for obtaining multiple candidate families based on sample data set training；

Structural unit, for based on sample data set construction assessment collection；

Assessment unit, for being commented respectively each candidate family according to assessment collection and default evaluation condition Estimate, obtains the assessment result for corresponding to each candidate family；

Selecting unit, for selecting assessment result to meet the candidate family of preset condition；

Integrated unit obtains malicious traffic stream inspection for integrating according to preset integrated approach to selected model Survey model.

Optionally, training unit includes:

Module is constructed, for constructing multiple training sets based on the sample data set；

Training module, set and the multiple training set for set, hyper parameter combination from machine learning algorithm Middle to be selected respectively, training obtains multiple candidate families；Wherein, a kind of machine learning algorithm, one group of hyper parameter and an instruction Practice to collect and determines a candidate family.

Optionally, the building module includes:

First building submodule, at least partly positive sample data for being concentrated based on the sample data are constructing one just Sample training subset carries out multiple repairing weld operation to the unmarked sample data that the sample data is concentrated and constructs multiple negative samples Training subset is respectively combined the positive sample training subset and the multiple negative sample training subset to obtain multiple training Collection；

Second building submodule, at least partly positive sample data building for being concentrated based on the sample data are multiple just Sample training subset carries out the unmarked sample data that the sample data is concentrated the multiple negative samples of operation building are employed many times Training subset is respectively combined each positive sample training subset and the multiple negative sample training subset to obtain multiple training Collection.

Optionally, the structural unit carries out sampling structure specifically for the positive sample data concentrated to the sample data Positive sample assessment subset is built, sampling building negative sample assessment is carried out to the unmarked sample data that the sample data is concentrated Positive sample is assessed subset and negative sample assessment sub-combinations obtains assessment collection by collection.

Optionally, the structural unit is specifically used for constructing multiple assessment collection based on the sample data set, wherein each It includes positive sample data and the unmarked sample data as negative sample data that assessment, which is concentrated,；

The assessment unit, is specifically used for for each candidate family, collects and preset assessment item according to the multiple assessment Part respectively assesses the candidate family, obtains multiple assessment results, merges the multiple assessment result and obtains candidate's mould The corresponding final assessment result of type.

The selecting unit, specifically for selecting the class interval of corresponding prediction result to be greater than the candidate family of preset value.

The selecting unit is greater than the candidate family of preset value specifically for selection corresponding A UC value.

Optionally, the integrated unit is specifically used for according to corresponding assessment result being each selected candidate family Corresponding weighted value is distributed, and selected candidate family is integrated according to weighted value.

On the other hand, the present invention provides a kind of computer readable storage medium, wherein the computer readable storage medium On be stored with computer program, wherein the computer program realizes above-mentioned structure when being executed by one or more computing devices The method for building the malicious traffic stream detection model based on PU study.

On the other hand, the present invention provides a kind of is including one or more computing devices and one or more storage devices It unites, record has computer program on one or more of storage devices, and the computer program is one or more of Computing device makes one or more of computing devices realize that above-mentioned building is learnt based on PU malicious traffic stream inspection when executing The method for surveying model.

Another aspect, the present invention provides the malicious traffic stream detection methods based on PU learning model, comprising:

Obtain data on flows to be detected；

According to the method as described in any one of aforementioned first aspect, malicious traffic stream detection model is constructed；

The data on flows to be detected is detected using obtained malicious traffic stream detection model.

Another aspect, the present invention provides a kind of malicious traffic stream detection systems based on PU learning model, wherein

Data to be tested acquiring unit, for obtaining data on flows to be detected；

Described in any item devices as above, for constructing malicious traffic stream detection model；

Detection unit, for being examined using obtained malicious traffic stream detection model to the data on flows to be detected It surveys.

By above-mentioned technical proposal, a kind of side of the building based on the PU malicious traffic stream detection model learnt provided by the invention Method and device can obtain multiple candidates by obtaining data on flows sample data set, and based on sample data set training Model, then based on sample data set construction assessment collection, according to assessment collection and default evaluation condition respectively to each Candidate family is assessed, and the assessment result for corresponding to each candidate family is obtained, and finally assessment result is selected to meet preset condition Candidate family, and selected model is integrated according to preset integrated approach, obtains malicious traffic stream detection model, from And the detection of malicious traffic stream can be carried out according to the malicious traffic stream detection model, it is relatively existing to use predetermined manner to flow number According to detection mode, the problem of present invention be can be avoided to manpower intervention, can machine learning execute the detection of malicious traffic stream automatically, It solves to artificial dependence in malicious traffic stream detection process, also, the method that the present invention is implemented combines PU learning model, It can be according to the potential feature and rule for finding malicious traffic stream in known malicious flow, thus for unknown data on flows When being detected, known feature is only relied upon compared to previous detection means and rule is compared, there is better accuracy.

The above description is only an overview of the technical scheme of the present invention, in order to better understand the technical means of the present invention, And it can be implemented in accordance with the contents of the specification, and in order to allow above and other objects of the present invention, feature and advantage can It is clearer and more comprehensible, the followings are specific embodiments of the present invention.

Detailed description of the invention

By reading the following detailed description of the preferred embodiment, various other advantages and benefits are common for this field Technical staff will become clear.The drawings are only for the purpose of illustrating a preferred embodiment, and is not considered as to the present invention Limitation.And throughout the drawings, the same reference numbers will be used to refer to the same parts.In the accompanying drawings:

Fig. 1 shows a kind of method of building based on the PU malicious traffic stream detection model learnt of proposition of the embodiment of the present invention Flow chart；

Fig. 2 shows device of a kind of building based on the PU malicious traffic stream detection model learnt that the embodiment of the present invention proposes Composition block diagram；

Fig. 3 shows another dress of the building based on the PU malicious traffic stream detection model learnt of proposition of the embodiment of the present invention The composition block diagram set；

Fig. 4 shows a kind of malicious traffic stream detection system composition frame based on PU learning model provided in an embodiment of the present invention Figure.

Specific embodiment

The exemplary embodiment that the present invention will be described in more detail below with reference to accompanying drawings.Although showing the present invention in attached drawing Exemplary embodiment, it being understood, however, that may be realized in various forms the present invention without should be by embodiments set forth here It is limited.It is to be able to thoroughly understand the present invention on the contrary, providing these embodiments, and can be by the scope of the present invention It is fully disclosed to those skilled in the art.

The embodiment of the invention provides a kind of method of the building based on the PU malicious traffic stream detection model learnt, this method is used It is detected in the data on flows detected for needs, its object is to learn mould for detecting the PU of malicious traffic stream by building Type detects data on flows to be detected, to realize malicious traffic stream detection function based on machine learning, solves Depend on artificial problem in existing malicious traffic stream detection process unduly, this method specific steps are as shown in Figure 1, comprising:

101, data on flows is obtained as sample data set.

In current network, it is flooded with a large amount of network data, these network datas constitute network flow, wherein the net Network flow can be understood as the data packet and network request quantity by particular network node.Based on there are various in network flow The network flow of malicious act is cheated, malice crawler such as network attack, flow.In the network flow of these malicious acts In, the malicious traffic stream overwhelming majority both is from automated procedures, is usually invaded by unauthorized mode, interferes, grabs other party Business or data；Network attack consumes system performance often through a large amount of access, causes database or system stuck, from And it can not externally provide service；And flow fraud normally results in public platform, short-sighted frequency, live streaming platform brush amount of access, amount of reading, And the brush list amount of electric business platform, so as to cause commodity sequence is influenced.Therefore, the detection of malicious traffic stream then can be understood as Detection to the similar above-mentioned flow with malice property in network flow.

It is a kind of building side of malicious traffic stream detection model based on PU study based on method described in the embodiment of the present invention Therefore method before constructing model, requires to obtain the sample data for being used for training pattern first, the sample data set is For data on flows, in addition, PU (Positive and unlabeled learning, abbreviation PU Learning) be referred to as positive example with Unmarked sample learning, i.e., the only positive sample data and unmarked sample data the case where under train classification models.Such as During flow detection described above, it is known that malicious traffic stream data be a small number of, be then more unknown flow number According to, in this case, be just suitble to choose PU study to carry out the training of model, therefore, then it is available it is above-mentioned include band There are the malicious traffic stream data of positive label as the positive sample data of sample data set and the unmarked sample data without label Sample data set trains corresponding disaggregated model to train based on PU study.

102, multiple candidate families are obtained based on sample data set training.

After getting sample data set, then the training of candidate family can be carried out by the sample data set.In general, normal In the model construction process for the PU study seen, classification is usually trained in previous research as negative sample never in marker samples Device, however, the inspection detected based on the different models that different algorithms, hyper parameter and training set are trained for malicious traffic stream Effect is surveyed to be different, and the selection in actual application for algorithm, hyper parameter then needs to rely on the experience of operator It is chosen, threshold is higher, therefore, can then choose sample data set in embodiments of the present invention and carry out multiple candidate families Training is also needed in this step before the multiple candidate families of training therefrom to choose more suitable model based on this Construct multiple and different training sets, specifically, choose training set actual mechanical process in can also in the manner described above, or Person's other modes are chosen, for example, it is also possible to extract partial data therein respectively from positive sample and unmarked sample, make respectively For the positive sample collection and negative sample collection of training set.After obtaining multiple and different training sets, then it can choose preset machine and calculate Method and hyper parameter train corresponding candidate family, specifically, its machine algorithm can be selected from preset machine algorithm model It takes, hyper parameter can then be calculated here, the candidate family can be by a kind of study by obtaining in set that hyper parameter combines Method carries out the determination of candidate family in conjunction with a training set in one group of selected hyper parameter and corresponding multiple training sets.

In addition, in embodiments of the present invention, in order to further increase the accuracy for the model construction result that PU learns, at this It can be with trained candidate family as much as possible, here, not done herein specifically for the quantity of candidate family in inventive embodiments Restriction, the training that the quantity that corresponds to actual needs carries out candidate family can be chosen in practical applications.

103, based on sample data set construction assessment collection.

Due to having obtained it being multiple candidate families in abovementioned steps 102, after obtaining above-mentioned candidate family, it is also necessary to carry out Evaluation operation, to evaluate suitable model from multiple candidate families.It also needs to pass through sample number in this step as a result, The construction of assessment collection is carried out according to collection.Specifically, the assessment collection can be respectively from positive sample data and unmarked sample data What middle sampling respectively obtained.In addition, in order to ensure the accuracy of assessment result, here, can also be obtained not by multiple sampling Same assessment collection can assess candidate family using multiple assessments collection so as to subsequent.

104, each candidate family is assessed respectively according to assessment collection and default evaluation condition, is obtained pair Answer the assessment result of each candidate family.

In evaluation process, required default evaluation condition can be chosen according to the actual needs, is commented for example, can choose Estimating condition is AUC value, then can be according to the AUC value of each assessment collection as assessment result.Certainly, selected herein default to comment Estimate condition and do not do specific restriction herein, any evaluation condition that can be used in model default result can be chosen and carried out.

105, selection assessment result meets the candidate family of preset condition.

Based on the difference of default evaluation condition, set preset condition is also different in this step.For example, current When to state default evaluation condition in step 104 be AUC value, then the preset condition in this step can be then the AUC greater than setting Threshold value is then determined as qualified candidate family.

106, selected model is integrated according to preset integrated approach, obtains malicious traffic stream detection model.

In the actual operation process, the candidate family for meeting default evaluation condition chosen in abovementioned steps 105 is often It is multiple, in order to further ensure can also to meet in this step by above-mentioned based on the accuracy of the PU model construction learnt The candidate family of preset condition is integrated, wherein integrated process can sort according to assessment result, and carries out weight for it Distribution, to be integrated according to weighted value to candidate family.

Further, in embodiments of the present invention, above-mentioned executing as the further refinement and extension of previous embodiment During step 101-106, specific executive mode can also be carried out such as following manner.

Wherein, in a step 102 based on the sample data set training obtain multiple candidate families when, a candidate family Training process be to be determined by a kind of machine learning algorithm, one group of hyper parameter and a training set.Therefore, getting It after stating sample data set, in the multiple candidate families of training, can also specifically include: be primarily based on the sample data set building Multiple training sets to training candidate family.Then the set combined from the set of machine learning algorithm, hyper parameter and institute It states and is selected respectively in multiple training sets, training obtains multiple candidate families.Here, for machine learning algorithm, hyper parameter Choose can voluntarily selection as described above according to the actual situation, it is not limited here, in addition, being based on the sample number When constructing multiple training sets according to collection, at least partly positive sample data that the sample data is concentrated can be primarily based on and construct one Positive sample training subset, and it is multiple negative to carry out multiple repairing weld operation building to the unmarked sample data that the sample data is concentrated Sample training subset.Then the positive sample training subset and the multiple negative sample training subset are respectively combined again To multiple training sets.It should be noted that in constructing training set during positive sample training subset, it can be as described above Building one positive sample training subset, extraction section positive sample can also construct multiple positive samples from positive sample data set Training subset, specifically can be with are as follows: firstly, multiple just based on at least partly positive sample data building that the sample data is concentrated Sample training subset, and multiple repairing weld operation is carried out to the unmarked sample data that the sample data is concentrated and constructs multiple negative samples This training subset.Then, then by each positive sample training subset and the multiple negative sample training subset it is respectively combined To multiple training sets.

Meanwhile when in step 03 based on sample data set construction assessment collection, it is based on the aforementioned mistake in implementation process What is obtained after being trained based on sample data set in journey is multiple candidate families, and for these models, accuracy is Different, therefore, it is also desirable to assess these candidate families, to obtain relatively accurate model, therefore, before assessment It when carrying out the construction of assessment collection, can also carry out in the following manner: firstly, the positive sample number concentrated to the sample data According to sampling building positive sample assessment subset is carried out, it is negative that sampling building is carried out to the unmarked sample data that the sample data is concentrated Then positive sample is assessed subset and negative sample assessment sub-combinations obtains assessment collection by Samples Estimates subset.In addition, in order into The accuracy of the raising assessment result of one step can also construct multiple assessment collection in this step, comment so that later use is multiple Estimate collection repeatedly to assess each candidate family, and determine comprehensive assessment effect according to multiple assessment result, i.e., based on described Sample data set constructs multiple assessment collection, wherein each assessment is concentrated including positive sample data and as negative sample data not Marker samples data.

When being assessed according to assessment collection and default evaluation condition each candidate family, when it is constructed be multiple comment It after estimating collection, then can then be carried out in the following manner in evaluation process: firstly, for each candidate family, according to described more A assessment collection and default evaluation condition respectively assess the candidate family, obtain multiple assessment results.Then, to each time Multiple assessment results of modeling type are merged, and will merge the multiple assessment result to obtain the candidate family corresponding most Whole assessment result is as actual assessment result.In addition, in embodiments of the present invention, based on different default evaluation conditions to commenting Estimate mode and assessment result presence directly affects, therefore for assessment result, its is right based on different default evaluation conditions The assessment result answered also is different, such as: when the default evaluation condition is maximal margin method, each candidate of correspondence The assessment result of model is the class interval of prediction result of each candidate family on assessment collection.And work as the default assessment item Part is when calculating the method for AUC value, and the assessment result of each candidate family of correspondence is each candidate family on assessment collection AUC value.Wherein, AUC value can be understood as a probability value, when you select a positive sample and negative sample at random, when It is exactly AUC value, AUC that this positive sample is come the probability before negative sample according to the fractional value being calculated by preceding sorting algorithm Value is bigger, illustrates that current class model is more possible to come positive sample before negative sample, so as to preferably classify, thus Determine that the classifying quality of model is more accurate.

When selecting assessment result to meet the candidate family of preset condition, based on aforementioned different default evaluation condition, then For the selection mode of candidate family that meets preset condition, there is also differences: on the one hand, when the default evaluation condition is most When large-spacing method, the assessment result of each candidate family of correspondence is prediction result of each candidate family on assessment collection Class interval.This step then can be with are as follows: the class interval of corresponding prediction result is selected to be greater than the candidate family of preset value.Another party Face, when the default evaluation condition is to calculate the method for AUC value, the assessment result of each candidate family of correspondence is each AUC value of the candidate family on assessment collection.Then this step can be with are as follows: selection corresponding A UC value is greater than the candidate family of preset value.

In implementation process, after selection assessment result meets the candidate family of preset condition, obtain meeting preset condition Candidate family it is often multiple, and the accuracy of above-mentioned candidate family is also not identical, therefore, in this case, It needs to integrate on above-mentioned model, wherein collecting in order to ensure the malicious traffic stream detection model after being integrated is more accurate At when can be that each candidate family is chosen corresponding weight and integrated according to assessment result.Therefore, according to preset Integrated approach integrates selected model, and the process for obtaining malicious traffic stream detection model can be with specifically: according to correspondence Assessment result be that each selected candidate family distributes corresponding weighted value, and according to weighted value to selected candidate mould Type is integrated.

In addition, the embodiment of the invention also provides a kind of detection sides of malicious traffic stream in combining specific application scenarios Method detects malicious traffic stream in data on flows to realize, wherein the realization process of this method can be as following shown:

It is possible, firstly, to obtain data on flows to be detected as sample data, and from wherein known malicious traffic stream as sample Marked positive sample data in notebook data, other unknown datas on flows are then as being unmarked sample data in sample.

Then, according to the positive sample data of above-mentioned determination and unmarked sample data, as sample data, and with this base The malicious traffic stream detection model to carry out malicious traffic stream detection is constructed on plinth, wherein building process can be such as above-described embodiment In step carry out, specifically can be with are as follows:

The first, data on flows is obtained as sample data set, and the sample data concentration includes the positive sample with positive label Data and unmarked sample data without label, wherein positive tag representation malicious traffic stream.

The second, multiple candidate families are obtained based on sample data set training.Wherein, based on each candidate family Training is to be obtained based on a machine learning algorithm, one group of hyper parameter and one group of training set training, and different algorithms exists The effect of detection malicious traffic stream is different, therefore, can be more by constructing first in the training process for carrying out candidate family A training set, and a model is respectively trained according to multiple training sets, to obtain the candidate of multiple malicious traffic stream detection models Model.

Third is collected based on sample data set construction assessment.Based on multiple candidate families for detection malicious traffic stream Accuracy is different, and can then be carried out the suitable model of selection by assessing at this and therefore also be needed root before assessment Multiple assessments are constructed according to sample data set according to the method for this step to collect.

4th, each candidate family is assessed respectively according to assessment collection and default evaluation condition, is obtained pair Answer the assessment result of each candidate family.It wherein when being assessed, is carried out based on evaluation condition, therefore, from above-mentioned In multiple candidate families choose be suitable for malicious traffic stream detection model when, can choose any one existing evaluation condition into Row, for example, the assessment result of each candidate family of correspondence is each time when the default evaluation condition is maximal margin method The class interval of prediction result of the modeling type on assessment collection；It is described when the default evaluation condition is the method for calculating AUC value The assessment result of corresponding each candidate family is AUC value of each candidate family on assessment collection.

5th, selection assessment result meets the candidate family of preset condition.Based on the different corresponding assessments of evaluation condition It is different when as a result, therefore it is also different for selecting the mode for the candidate family for meeting preset condition based on different assessment results 's.For example, the candidate family that the selection assessment result meets preset condition includes: the class interval for selecting corresponding prediction result Greater than the candidate family of preset value；The candidate family that the selection assessment result meets preset condition includes: selection corresponding A UC value Greater than the candidate family of preset value.

6th, selected model is integrated according to preset integrated approach, obtains malicious traffic stream detection model.Tool Body, when being integrated, mode can be that each selected candidate family distribution is corresponding according to corresponding assessment result Weighted value, and selected candidate family is integrated according to weighted value.

Finally, recycling the malicious traffic stream detection model to carry out the detection of malicious traffic stream, to realize from existing a small amount of In the case where known malicious traffic stream, obtains potential rule and carry out the detection function of malicious traffic stream from unknown flow with this Energy.

Specifically, realization process can be through malicious traffic stream detection model to each to be checked during detection The data on flows of survey carries out scoring operations, to obtain the score of each data on flows, and to be above-mentioned flow number on the basis of this It is operated according to being ranked up, and therefrom determines which is to dislike according to collating sequence according to the sequence of the data on flows obtained after sequence Meaning flow or potential malicious traffic stream.

In addition, accurately identifying and taking precautions against in order to ensure subsequent malicious traffic stream, it can also be through malicious traffic stream detection model The data for being detected as malicious traffic stream carry out the statistics and conclusion of feature, so that the feature or feature set of malicious traffic stream are obtained, and with This feature or feature set as take precautions against and identification malicious traffic stream when foundation.

Further, the realization as the method to above-mentioned building based on the PU malicious traffic stream detection model learnt, this hair Bright embodiment provides a kind of device of the building based on the PU malicious traffic stream detection model learnt, which is mainly useful for needing The data on flows to be detected is detected, and its object is to be treated by constructing for detecting the PU learning model of malicious traffic stream The data on flows of detection is detected, to realize the malicious traffic stream detection function based on machine learning, solves existing evil Artificial problem is depended on unduly during meaning flow detection.To be easy to read, present apparatus embodiment is no longer implemented preceding method Detail content in example is repeated one by one, it should be understood that the device in the present embodiment, which can correspond to, realizes that preceding method is real Apply the full content in example.The device is as shown in Fig. 2, specifically include:

Acquiring unit 21 can be used for obtaining data on flows as sample data set, and it includes band that the sample data, which is concentrated, Just the positive sample data of label and the unmarked sample data without label, wherein positive tag representation malicious traffic stream；

Training unit 22, the sample data set training that can be used for obtaining based on the acquiring unit 21 obtain multiple candidates Model；

Structural unit 23, the sample data set construction assessment collection that can be used for obtaining based on the acquiring unit 21；

Assessment unit 24, the assessment collection and default evaluation condition point that can be used for being constructed according to the structural unit 23 Other each candidate family at the training of training unit 22 is assessed, and the assessment result for corresponding to each candidate family is obtained；

Selecting unit 25, the assessment result that can be used for that the assessment unit 34 is selected to obtain meet the candidate of preset condition Model；

Integrated unit 26 can be used for collecting the selected model of selected unit 35 according to preset integrated approach At obtaining malicious traffic stream detection model.

Further, as shown in figure 3, the training unit 22 includes:

Module 221 is constructed, can be used for constructing multiple training sets based on the sample data set；

Training module 222 can be used for the set and the building of the set from machine learning algorithm, hyper parameter combination It is selected respectively in multiple training sets that module 221 constructs, training obtains multiple candidate families；Wherein, a kind of machine learning Algorithm, one group of hyper parameter and a training set determine a candidate family.

Further, as shown in figure 3, the building module 221 includes:

First building submodule 2211, can be used for at least partly positive sample data structure concentrated based on the sample data A positive sample training subset is built, it is more to carry out multiple repairing weld operation building to the unmarked sample data that the sample data is concentrated The positive sample training subset and the multiple negative sample training subset are respectively combined to obtain by a negative sample training subset Multiple training sets；

Second building submodule 2212, can be used for at least partly positive sample data structure concentrated based on the sample data Multiple positive sample training subsets are built, it is more to carry out multiple repairing weld operation building to the unmarked sample data that the sample data is concentrated Each positive sample training subset and the multiple negative sample training subset are respectively combined to obtain by a negative sample training subset Multiple training sets.

Further, as shown in figure 3, the assessment unit 24, can be specifically used for concentrating just the sample data Sample data carries out sampling building positive sample assessment subset, and the unmarked sample data concentrated to the sample data samples It constructs negative sample and assesses subset, positive sample is assessed into subset and negative sample assessment sub-combinations obtain assessment collection.

Further, as shown in figure 3, the structural unit 23, can be specifically used for constructing based on the sample data set Multiple assessment collection, wherein it includes positive sample data and the unmarked sample data as negative sample data that each assessment, which is concentrated,；

The assessment unit 24 can also be specifically used for constructing each candidate family according to the structural unit 23 Multiple assessment collection and default evaluation condition the candidate family is assessed respectively, multiple assessment results are obtained, described in fusion Multiple assessment results obtain the corresponding final assessment result of the candidate family.

Further, as shown in figure 3, when the default evaluation condition is maximal margin method, each candidate mould of the correspondence The assessment result of type is the class interval of prediction result of each candidate family on assessment collection；

The selecting unit 25 can be specifically used for the candidate for selecting the class interval of corresponding prediction result to be greater than preset value Model.

Further, as shown in figure 3, the default evaluation condition is correspondence each time when calculating the method for AUC value The assessment result of modeling type is AUC value of each candidate family on assessment collection；

The selecting unit 25 can also be specifically used for the candidate family that selection corresponding A UC value is greater than preset value.

Further, as shown in figure 3, the integrated unit 26, can be specifically used for according to corresponding assessment result being every A selected candidate family distributes corresponding weighted value, and is integrated according to weighted value to selected candidate family.

Further, as the realization to above-mentioned malicious traffic stream detection function, the embodiment of the invention provides one kind to be based on The malicious traffic stream detection system of PU learning model, the system are mainly useful for that the data on flows detected is needed to be detected, Purpose is to detect data on flows to be detected by constructing for detecting the PU learning model of malicious traffic stream, thus It realizes the malicious traffic stream detection function based on machine learning, solves and depended on unduly in existing malicious traffic stream detection process manually The problem of.To be easy to read, this system embodiment no longer repeats the detail content in preceding method embodiment one by one, but It will be appreciated that the system in the present embodiment can correspond to the full content realized in preceding method embodiment.The system such as Fig. 4 institute Show, specifically include:

Data to be tested acquiring unit 41 can be used for obtaining data on flows to be detected；

The device 42 based on the PU malicious traffic stream detection model learnt is constructed, can be used for being obtained according to acquiring unit 41 Data on flows to be detected constructs malicious traffic stream detection model；Wherein, building is based on the PU malicious traffic stream detection model learnt Device 4 specifically can be volume device as shown in Figure 2 or Figure 3；

Detection unit 43 can be used for using obtained by device 42 of the building based on the PU malicious traffic stream detection model learnt Malicious traffic stream detection model the data on flows to be detected is detected.

Further, the embodiment of the invention also provides a kind of computer readable storage mediums, wherein the computer can It reads to be stored with computer program on storage medium, wherein real when the computer program is executed by one or more computing devices Existing above-mentioned method of the building based on the PU malicious traffic stream detection model learnt.

In addition, including one or more computing devices and one or more storage dresses the embodiment of the invention also provides one kind The system set, record has computer program on one or more of storage devices, and the computer program is one Or the malice that multiple computing devices make one or more of computing devices realize that above-mentioned building is learnt based on PU when executing The method of flow detection model.

In conclusion method of a kind of building based on the PU malicious traffic stream detection model learnt that the embodiment of the present invention proposes And device, multiple candidate moulds can be obtained by obtaining data on flows sample data set, and based on sample data set training Type, then based on sample data set construction assessment collection, according to assessment collection and default evaluation condition respectively to each time Modeling type is assessed, and the assessment result for corresponding to each candidate family is obtained, and finally assessment result is selected to meet preset condition Candidate family, and selected model is integrated according to preset integrated approach, malicious traffic stream detection model is obtained, thus The detection of malicious traffic stream can be carried out according to the malicious traffic stream detection model, it is relatively existing to use predetermined manner to data on flows The problem of detection mode, the present invention be can be avoided to manpower intervention, can machine learning execute the detection of malicious traffic stream automatically, solution To artificial dependence in malicious traffic stream of having determined detection process, also, the method that the present invention is implemented combines PU learning model, energy It is enough according to the potential feature and rule that find malicious traffic stream in known malicious flow, thus for unknown data on flows into When row detection, known feature is only relied upon compared to previous detection means and rule is compared, there is better accuracy.

In the above-described embodiments, it all emphasizes particularly on different fields to the description of each embodiment, there is no the portion being described in detail in some embodiment Point, reference can be made to the related descriptions of other embodiments.

It is understood that the correlated characteristic in the above method and device can be referred to mutually.In addition, in above-described embodiment " first ", " second " etc. be and not represent the superiority and inferiority of each embodiment for distinguishing each embodiment.

It is apparent to those skilled in the art that for convenience and simplicity of description, the system of foregoing description, The specific work process of device and unit, can refer to corresponding processes in the foregoing method embodiment, and details are not described herein.

Algorithm and display are not inherently related to any particular computer, virtual system, or other device provided herein. Various general-purpose systems can also be used together with teachings based herein.As described above, it constructs required by this kind of system Structure be obvious.In addition, the present invention is also not directed to any particular programming language.It should be understood that can use various Programming language realizes summary of the invention described herein, and the description done above to language-specific is to disclose this hair Bright preferred forms.

In addition, memory may include the non-volatile memory in computer-readable medium, random access memory (RAM) and/or the forms such as Nonvolatile memory, such as read-only memory (ROM) or flash memory (flash RAM), memory includes extremely A few storage chip.

It should be understood by those skilled in the art that, embodiments herein can provide as method, system or computer program Product.Therefore, complete hardware embodiment, complete software embodiment or reality combining software and hardware aspects can be used in the application Apply the form of example.Moreover, it wherein includes the computer of computer usable program code that the application, which can be used in one or more, The computer program implemented in usable storage medium (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) produces The form of product.

The application is referring to method, the process of equipment (system) and computer program product according to the embodiment of the present application Figure and/or block diagram describe.It should be understood that every one stream in flowchart and/or the block diagram can be realized by computer program instructions The combination of process and/or box in journey and/or box and flowchart and/or the block diagram.It can provide these computer programs Instruct the processor of general purpose computer, special purpose computer, Embedded Processor or other programmable data processing devices to produce A raw machine, so that being generated by the instruction that computer or the processor of other programmable data processing devices execute for real The device for the function of being specified in present one or more flows of the flowchart and/or one or more blocks of the block diagram.

These computer program instructions, which may also be stored in, is able to guide computer or other programmable data processing devices with spy Determine in the computer-readable memory that mode works, so that it includes referring to that instruction stored in the computer readable memory, which generates, Enable the manufacture of device, the command device realize in one box of one or more flows of the flowchart and/or block diagram or The function of being specified in multiple boxes.

These computer program instructions also can be loaded onto a computer or other programmable data processing device, so that counting Series of operation steps are executed on calculation machine or other programmable devices to generate computer implemented processing, thus in computer or The instruction executed on other programmable devices is provided for realizing in one or more flows of the flowchart and/or block diagram one The step of function of being specified in a box or multiple boxes.

In a typical configuration, calculating equipment includes one or more processors (CPU), input/output interface, net Network interface and memory.

Memory may include the non-volatile memory in computer-readable medium, random access memory (RAM) and/ Or the forms such as Nonvolatile memory, such as read-only memory (ROM) or flash memory (flash RAM).Memory is computer-readable Jie The example of matter.

Computer-readable medium includes permanent and non-permanent, removable and non-removable media can be by any method Or technology come realize information store.Information can be computer readable instructions, data structure, the module of program or other data. The example of the storage medium of computer includes, but are not limited to phase change memory (PRAM), static random access memory (SRAM), moves State random access memory (DRAM), other kinds of random access memory (RAM), read-only memory (ROM), electric erasable Programmable read only memory (EEPROM), flash memory or other memory techniques, read-only disc read only memory (CD-ROM) (CD-ROM), Digital versatile disc (DVD) or other optical storage, magnetic cassettes, tape magnetic disk storage or other magnetic storage devices Or any other non-transmission medium, can be used for storage can be accessed by a computing device information.As defined in this article, it calculates Machine readable medium does not include temporary computer readable media (transitory media), such as the data-signal and carrier wave of modulation.

It should also be noted that, the terms "include", "comprise" or its any other variant are intended to nonexcludability It include so that the process, method, commodity or the equipment that include a series of elements not only include those elements, but also to wrap Include other elements that are not explicitly listed, or further include for this process, method, commodity or equipment intrinsic want Element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that including element There is also other identical elements in process, method, commodity or equipment.

It will be understood by those skilled in the art that embodiments herein can provide as method, system or computer program product. Therefore, complete hardware embodiment, complete software embodiment or embodiment combining software and hardware aspects can be used in the application Form.It is deposited moreover, the application can be used to can be used in the computer that one or more wherein includes computer usable program code The shape for the computer program product implemented on storage media (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) Formula.

The above is only embodiments herein, are not intended to limit this application.To those skilled in the art, Various changes and changes are possible in this application.It is all within the spirit and principles of the present application made by any modification, equivalent replacement, Improve etc., it should be included within the scope of the claims of this application.

Claims

1. a kind of method of the building based on the PU malicious traffic stream detection model learnt, wherein the described method includes:

It obtains data on flows and is used as sample data set, it includes positive sample data with positive label and without mark that the sample data, which is concentrated, The unmarked sample data of label, wherein positive tag representation malicious traffic stream；

Multiple candidate families are obtained based on sample data set training；

Assessment collection is constructed based on the sample data set；

Each candidate family is assessed respectively according to assessment collection and default evaluation condition, obtains corresponding to each time The assessment result of modeling type；

Selection assessment result meets the candidate family of preset condition；

Selected model is integrated according to preset integrated approach, obtains malicious traffic stream detection model.

2. the method for claim 1, wherein obtaining multiple candidate families based on sample data set training includes:

Multiple training sets are constructed based on the sample data set；

It is selected, is instructed respectively from the set and the multiple training set that the set of machine learning algorithm, hyper parameter combine Get multiple candidate families；Wherein, a kind of machine learning algorithm, one group of hyper parameter and a training set determine a candidate mould Type.

3. method according to claim 2, wherein described to include: based on the multiple training sets of sample data set building

A positive sample training subset is constructed based on at least partly positive sample data that the sample data is concentrated, to the sample Unmarked sample data in data set carries out multiple repairing weld operation and constructs multiple negative sample training subsets, and the positive sample is instructed Practice subset and the multiple negative sample training subset is respectively combined to obtain multiple training sets；

Alternatively,

Multiple positive sample training subsets are constructed based on at least partly positive sample data that the sample data is concentrated, to the sample Unmarked sample data in data set carries out that the multiple negative sample training subsets of operation building are employed many times, and each positive sample is instructed Practice subset and the multiple negative sample training subset is respectively combined to obtain multiple training sets.

4. the method for claim 1, wherein described include: based on sample data set construction assessment collection

Sampling building positive sample assessment subset is carried out to the positive sample data that the sample data is concentrated, to the sample data set In unmarked sample data carry out sampling building negative sample assessment subset, positive sample is assessed into subset and negative sample and assesses subset Combination obtains assessment collection.

5. the method for claim 1, wherein

The sample data set construction assessment collection that is based on includes: to construct multiple assessments based on the sample data set to collect, In each assessment to concentrate include positive sample data and the unmarked sample data as negative sample data；

It is described that each candidate family is assessed respectively according to assessment collection and default evaluation condition, it obtains corresponding every The assessment result of a candidate family, comprising: for each candidate family, according to the multiple assessment collection and default evaluation condition point It is other that the candidate family is assessed, multiple assessment results are obtained, the multiple assessment result is merged and obtains the candidate family pair The final assessment result answered.

6. a kind of malicious traffic stream detection method based on PU learning model, wherein

Obtain data on flows to be detected；

According to the method according to any one of claims 1 to 5, malicious traffic stream detection model is constructed；

7. a kind of device of the building based on the PU malicious traffic stream detection model learnt, wherein described device includes:

Acquiring unit, for obtaining data on flows as sample data set, the sample data concentration include with positive label just Sample data and unmarked sample data without label, wherein positive tag representation malicious traffic stream；

Assessment unit is obtained for being assessed respectively each candidate family according to assessment collection and default evaluation condition To the assessment result of each candidate family of correspondence；

Integrated unit obtains malicious traffic stream detection mould for integrating according to preset integrated approach to selected model Type.

8. a kind of malicious traffic stream detection system based on PU learning model, wherein

Data to be tested acquiring unit, for obtaining data on flows to be detected；

Device as claimed in claim 7, for constructing malicious traffic stream detection model；

Detection unit, for being detected using obtained malicious traffic stream detection model to the data on flows to be detected.

9. a kind of computer readable storage medium, wherein it is stored with computer program on the computer readable storage medium, In, side described in any one of claim 1-6 is realized when the computer program is executed by one or more computing devices Method.

10. a kind of system including one or more computing devices and one or more storage devices, one or more of to deposit Record has computer program on storage device, and the computer program makes institute when being executed by one or more of computing devices It states one or more computing devices and realizes such as method of any of claims 1-6.