CN109714341A

CN109714341A - A kind of Web hostile attack identification method, terminal device and storage medium

Info

Publication number: CN109714341A
Application number: CN201811619182.5A
Authority: CN
Inventors: 陈奋; 陈荣有; 程长高; 姚鸿富; 吴顺祥; 高云龙; 陈柏华
Original assignee: Xiamen Service Cloud Mdt Infotech Ltd
Current assignee: Xiamen Service Cloud Mdt Infotech Ltd
Priority date: 2018-12-28
Filing date: 2018-12-28
Publication date: 2019-05-03

Abstract

The present invention relates to Web security technology areas, propose a kind of Web hostile attack identification method, terminal device and storage medium, in the method, two big steps are identified including model foundation and data, in model foundation, including step 1: acquiring the blacklist and white list sample data of a large amount of web access data respectively, after carrying out unified decoding to the network address in sample data, decoded network address is carried out character processing；Step 2: feature extraction is carried out to by the processed sample data of step 1 by TF-IDF algorithm, calculates the characteristic value of each sample data；Step 3: according to the characteristic value of blacklist and white list sample data, being trained by algorithm of support vector machine, disaggregated model and preservation after being trained, and the disaggregated model is for distinguishing blacklist data and white list data.TF-IDF and support vector machines are applied to Web safety detection by the present invention, quickly to identify that malicious attack is requested.

Description

A kind of Web hostile attack identification method, terminal device and storage medium

Technical field

The present invention relates to Web security technology area more particularly to a kind of Web hostile attack identification method, terminal device and Storage medium.

Background technique

With the development of Global Internet, the world has been introduced into a high speed information epoch.Pass through network, Ren Menke Easily to browse and share huge network data, meanwhile, the core business of more and more enterprises is real using Web application Existing, this makes enterprise fortune be closely related with network security, and then is closely related with the life of broad masses.However, because Web Opening, the uncontrollability of itself, so that hacker is emerged one after another using the security incident that network hole is attacked.Recently, whole world neck First network security and Radware company, application delivery solution provider has issued second part of safe tune of year Web application Look into report: Radware2018 Web application security status.Report points out that most enterprises' (67%) think that hacker still can invade Enter enterprise network.It reports while pointing out, at least 89% interviewee was met in the previous year for Web application or Web clothes The be engaged in attack of device especially claims that the interviewee by encryption Web attack has risen to 2018 from 12% in 2017 50%.Most interviewees' (59%) claim daily or weekly can all have attack.The frequency and complexity attacked with Web Continuous to increase, traditional Web preventive means facing challenges also increase with it, and disadvantage also gradually highlights.

Up to the present, traditional Web preventive means, substantially the blacklist testing mechanism dependent on rule, either Web application firewall or ids etc. depend on the canonical of detecting and alarm, carry out the matching of message.Although can resist big Partial attack, but still there are the following problems:

1, rule base is difficult in maintenance.Currently, the attack means deformation of attacker is more next more, different coding staffs is such as used The skills such as formula, capital and small letter variation and alternative sentence, it is possible to around detection, implement various modifications attack.If to this Characterization rules are all added in a little deformation attacks, and feature database can be made too fat to move, difficult in maintenance.

2, regular formulation requires high.Rule write it is too wide in range easily manslaughter, rule is write too thin, easily bypasses.

3, when canonical item number is excessive, protective performance is seriously affected.

4, to new attack means, protective capacities is poor.

By the analysis to tradition dependent on the black list testing mechanism of rule, it can be derived that how to accomplish in magnanimity It is fast and accurate in request really to identify that malicious attack is requested, it is the problem that we need to solve at present.

Summary of the invention

In view of the above-mentioned problems, the present invention is intended to provide a kind of Web hostile attack identification method, terminal device and storage are situated between TF-IDF and support vector machines are applied to Web and examined safely by matter by introducing machine learning the relevant technologies in the security fields Web It surveys, quickly to identify that malicious attack is requested.

Concrete scheme is as follows:

A kind of Web hostile attack identification method, comprising the following steps:

(1), disaggregated model is established

Step 1: the blacklist and white list sample data of a large amount of web access data are acquired respectively, in sample data Network address carry out unified decoding, after being converted into unified coded format, decoded network address is subjected to character Processing influences to avoid meaningless character and carries out the unification of format；

Step 2: feature extraction is carried out to by the processed sample data of step 1 by TF-IDF algorithm, is calculated every The characteristic value of a sample data；

Step 3: according to the characteristic value of blacklist and white list sample data, being trained by algorithm of support vector machine, Disaggregated model and preservation after being trained, the disaggregated model is for distinguishing blacklist data and white list data；

(2), data identify

Step 4: it after being decoded to the network address of the access data received, is converted into step 1 and uses Coded format, while by decoded network address carry out character processing；

Step 5: feature extraction is carried out to by the processed data of step 4 by TF-IDF algorithm, calculates data Characteristic value；

Step 6: according to the characteristic value of data, access data is identified by disaggregated model, judge whether it belongs to Blacklist data.

Further, the character processing are as follows: all letters are uniformly set as upper case character or lowercase character, will be owned Chinese and number are uniformly set as specific character, and the specific character is and the character in network address in addition to Chinese and number Different characters.

Further, the calculating process of the characteristic value are as follows:

(1), the length of word in data is set as s, and it is s that data are divided into multiple length in sequence according to the length s of word Word；

(2), the word frequency TF:TF=1+ln (N) of each word is calculated, in which: N is the number that the word occurs in data；

(3), the inverse document frequency IDF:IDF=1+ln (p/q) of each word is calculated, wherein p is data count, and q is Data number comprising the word；

(4), the characteristic value TF-IDF of the data is calculated:

Further, the length s=3 of institute's predicate.

Further, screened in step 3 maximum 1000 sample datas of TF-IDF value in step 2 as support to The training data of amount machine algorithm.

A kind of Web malicious attack identification terminal equipment, including processor, memory and storage are in the memory simultaneously The computer program that can be run on the processor, the processor realize that the present invention is implemented when executing the computer program The step of example above-mentioned method.

A kind of computer readable storage medium, the computer-readable recording medium storage have computer program, feature The step of being, above-mentioned method of the embodiment of the present invention realized when the computer program is executed by processor.

The present invention uses technical solution as above, by introducing machine learning the relevant technologies in the security fields Web, by TF-IDF It is applied to Web safety detection with support vector machines, quickly to identify that malicious attack is requested, while the model established infuses sql Enter attack, XSS attack has very high precision of prediction, and model have deformation attack recognition, new attack mode identify and The ability of semantic analysis.

Detailed description of the invention

Fig. 1 show the flow diagram of the embodiment of the present invention one.

Fig. 2 show the schematic diagram of the algorithm of support vector machine of the embodiment of the present invention one.

Specific embodiment

To further illustrate that each embodiment, the present invention are provided with attached drawing.These attached drawings are that the invention discloses one of content Point, mainly to illustrate embodiment, and the associated description of specification can be cooperated to explain the operation principles of embodiment.Cooperation ginseng These contents are examined, those of ordinary skill in the art will be understood that other possible embodiments and advantages of the present invention.

Now in conjunction with the drawings and specific embodiments, the present invention is further described.

Embodiment one:

Refering to what is shown in Fig. 1, the present invention provides a kind of Web hostile attack identification methods, comprising the following steps:

(1), disaggregated model is established

Step 1: the blacklist and white list sample data of a large amount of web access data are acquired respectively, in sample data Network address (URL) carry out unified decoding, after being converted into unified coded format, decoded network address is carried out Character processing influences to avoid meaningless character and carries out the unification of format.

The character handles those skilled in the art and can set according to demand, is specially in the embodiment are as follows: will All letters are uniformly set as upper case character or lowercase character, and all Chinese and number are uniformly set as specific character, described Specific character is the character different from the character in network address in addition to Chinese and number.

It is described that by Chinese and number, to be uniformly set as specific character be the influence for being used to reject hash, should be for For the judgement of Web blacklist, Chinese and number are therefore meaningless character is set for blacklist judgement The process that feature extraction can be simplified for specific character accelerates the speed of identification.

In the embodiment, all letters are revised as small letter, therefore, are set as capitalizing by specific character in the embodiment Alphabetical " N ".Those skilled in the art also can be set as other characters.

Step 2: feature extraction is carried out to by the processed sample data of step 1 by TF-IDF algorithm, is calculated every The characteristic value of a sample data.

TF-IDF (term frequency-inverse document frequency) be it is a kind of for information retrieval with The common weighting technique of data mining.For assessing a word or a word for wherein one in a file set or a corpus The significance level of part file.In the embodiment, it is flexibly applied in the security fields Web, is based on by the way that TF-IDF is this The method of statistics carries out feature extraction to a large amount of web access data, most representative key word is obtained, to realize Feature conversion.

(1), during feature converts, firstly, it is necessary to carry out word frequency to by the processed sample data of step 1 The statistics of TF (Term frequency).It is the length of a word by three character settings during specific implementation, it is also same When consider data be smoothed and data are normalized, to improve the prediction classified after feature conversion Precision.

When in smoothing process, the calculation formula of word frequency TF optimizes are as follows:

TF=1+ln (N)

Wherein: N is the number that certain word occurs in data.

It is described in detail below with a specific example:

It is assumed that an access data are "/css/css_js.php ", it is the length of a word according to three character settings, The access data may be split into 13 words, be respectively as follows :/cs, css, ss/, s/c ,/cs, css, ss_, s_j, _ js, js., s.p, .ph, php, wherein "/cs " and " css " occurs twice altogether, remaining 11 word only occurs once, then passing through the meter of word frequency TF It calculates formula to be calculated: the TF value of "/cs " and " css " are as follows: 1.693, the TF value of remaining word is 1.

(2), in above-mentioned example, the TF value of "/cs " and " css " and other words can serve as the access after feature conversion The dimension of data is supplied to sorting algorithm and is detected, wherein "/cs " and " css " frequency of occurrences with respect to other words frequency compared with Height will play bigger effect during detection, still, if these words are in blacklist sample and white list sample In it is all a large amount of occur, such as "/cs " and " css " distinguish it is representative just very small in blacklist and white list sample, they The effect played in the detection just becomes very little, if be just difficult to using the method progress feature conversion for only considering word frequency The key feature of blacklist and white list sample is counted, so that the normal data being difficult to detect by web access and abnormal number According to.If some word, only occur in blacklist sample, although opposite total sample number, the number that it occurs is few, it is being detected Shared weight is still very high in blacklist sample.So, consider the generation according to each word in blacklist and white list sample Table gives one corresponding weight of each word.One word can predict that normal and abnormal data ability is bigger, and weight is got over Greatly, conversely, weight is smaller.It is assumed that word " css " only occurs in blacklist sample, then its weight when predicting black sample is just It is bigger, on the contrary it is smaller.In the embodiment, come using inverse document frequency IDF (Inverse document frequency) It is measured.

Set the calculation formula of IDF are as follows:

IDF=1+ln (p/q)

Wherein, p is total sample number, and q is the sample number comprising the word.

It suppose there is 100,000 sample datas, all types of data volumes of these sample datas are suitable, wherein there is 200 sample numbers " css " word is contained in, and "/cs " is contained in 1000 sample datas, then,

The weight of " css " in the sample are as follows: IDF=1+ln (100000/200)=7.215,

The weight of "/cs " in the sample are as follows: IDF=1+ln (100000/1000)=5.605.

(3), according to TF value calculated above, after being re-introduced into IDF value, the TF-IDF value of each word is calculated:

TF-IDF=TF*IDF

Then: the TF-IDF value of "/cs " and " css " are respectively 9.489 and 12.215.

It is obtained according to the above results, word " css " in the detection process, will play bigger effect.

(4), the characteristic value of sample data, i.e. TF-IDF value are calculated according to following equation:

Wherein, n is the number for the word for including in sample data.

(5), the characteristic value of sample data is normalized, and is normalized in the embodiment using Frobenius norm Processing, calculation formula are as follows:

By the above method in the embodiment, " representativeness " of word frequency and word in the sample is comprehensively considered, by every number According to be fractionation that length carries out word by three characters, then calculate the overall target TF-IDF value of each word as characteristic value, most Eventually, achieve the purpose that feature converts.

Step 3: according to the characteristic value of blacklist and white list sample data, being trained by algorithm of support vector machine, It is adjusted by parameter, the optimal classification model after being trained, and model is saved, the disaggregated model is black for distinguishing List data and white list data.

Since it is considered that dimension is higher, calculating is more complicated, when training data is huge, be easy to cause " dimension disaster ", together When, excessive dimension is not necessarily very helpful to the raising of accuracy, we have screened TF-IDF in specific implementation It is worth the dimension of training and test of maximum 1000 words as algorithm of support vector machine.

Data are trained and are predicted by support vector machines (support vector machine, SVM) algorithm.Branch Holding vector machine is a kind of sorting algorithm, and generalization ability is improved by seeking structuring least risk, realizes empiric risk and sets The minimum of letter range can also obtain the purpose of good statistical law to reach in the case where statistical sample amount is less.Such as Shown in Fig. 2, it is a kind of two classification model, and basic model is defined as the maximum linear classification in the interval on feature space Device, the i.e. learning strategy of support vector machines are margin maximizations.According to the characteristic value of blacklist and white list sample data, lead to Blacklist data and white list data can be distinguished by crossing the disaggregated model that algorithm of support vector machine trains.

Support vector machines, which is selected, as the reason of sorting algorithm the following:

1, it is based on structural risk minimization, in this way can be to avoid overfitting problem, generalization ability is strong.

2, support vector machines has the small-sample learning method of solid theoretical basis.It is not related to probability measure and big substantially Number law.It inherently sees, avoids from the conventional procedure concluded to deduction, realize efficiently from training sample to pre- test sample This " transduction inference ", the problems such as enormously simplifying common classification and return.

3, the terminal decision function of support vector machines is only determined that the complexity of calculating depends on by a small number of supporting vectors The number of supporting vector, rather than the dimension of sample space, this avoids " dimension disaster " in some sense.

4, a small number of supporting vectors determine final result, this facilitates grasp the key link sample, " rejecting " bulk redundancy sample, And it is simple to be doomed this method algorithm, while having preferable " robust " property.

(2), data identify

After establishing disaggregated model, so that it may the web access data newly received are predicted, judge its whether be Blacklist data.

Step 4: it after being decoded to the network address of the access data received, is converted into step 1 and uses Coded format, while by decoded network address Chinese and number be uniformly set as character used in step 1；

In the embodiment, select 140,000 truthful datas as training and test, wherein random selection 80% is as instruction Practice data, 20% is used as test data, seeks prediction accuracy of the mean for such cross validation 10 times, uses N-gram in experiment Four kinds of+SVM, TF-IDF+SVM, TF-IDF+KNN, TF-IDF+Logistic Regression algorithm modes are tested, such as Shown in table 1, experiment shows the model of TF-IDF+SVM, accuracy highest, accuracy 99.89%, while the model to sql Injection attacks, XSS attack have a very high precision of prediction, and model there is deformation attack recognition, new attack mode to identify with And the ability of semantic analysis.

Table 1

Embodiment two:

The present invention also provides a kind of Web malicious attack identification terminal equipment, including memory, processor and it is stored in institute The computer program that can be run in memory and on the processor is stated, the processor executes real when the computer program Step in the above method embodiment of the existing embodiment of the present invention one.

Further, as an executable scheme, the Web malicious attack identification terminal equipment can be desktop meter Calculation machine, notebook, palm PC and cloud server etc. calculate equipment.The Web malicious attack identification terminal equipment may include, But it is not limited only to, processor, memory.It will be understood by those skilled in the art that above-mentioned Web malicious attack identification terminal equipment Composed structure is only the example of Web malicious attack identification terminal equipment, is not constituted to Web malicious attack identification terminal equipment Restriction, may include perhaps combining certain components or different components, such as institute than above-mentioned more or fewer components Stating Web malicious attack identification terminal equipment can also include input-output equipment, network access equipment, bus etc., and the present invention is real Example is applied not limit this.

Further, as an executable scheme, alleged processor can be central processing unit (Central Processing Unit, CPU), it can also be other general processors, digital signal processor (Digital Signal Processor, DSP), it is specific integrated circuit (Application Specific Integrated Circuit, ASIC), existing At programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic device, discrete Door or transistor logic, discrete hardware components etc..General processor can be microprocessor or the processor can also To be any conventional processor etc., the processor is the control centre of the Web malicious attack identification terminal equipment, is utilized The various pieces of various interfaces and the entire Web malicious attack identification terminal equipment of connection.

The memory can be used for storing the computer program and/or module, and the processor is by operation or executes Computer program in the memory and/or module are stored, and calls the data being stored in memory, described in realization The various functions of Web malicious attack identification terminal equipment.The memory can mainly include storing program area and storage data area, Wherein, storing program area can application program needed for storage program area, at least one function；Storage data area can store basis Mobile phone uses created data etc..In addition, memory may include high-speed random access memory, it can also include non-easy The property lost memory, such as hard disk, memory, plug-in type hard disk, intelligent memory card (Smart Media Card, SMC), secure digital (Secure Digital, SD) card, flash card (Flash Card), at least one disk memory, flush memory device or other Volatile solid-state part.

The present invention also provides a kind of computer readable storage medium, the computer-readable recording medium storage has computer Program, when the computer program is executed by processor the step of the realization above method of the embodiment of the present invention.

If the integrated module/unit of the Web malicious attack identification terminal equipment is real in the form of SFU software functional unit Now and when sold or used as an independent product, it can store in a computer readable storage medium.Based in this way Understanding, the present invention realize above-described embodiment method in all or part of the process, can also be instructed by computer program Relevant hardware is completed, and the computer program can be stored in a computer readable storage medium, the computer program When being executed by processor, it can be achieved that the step of above-mentioned each embodiment of the method.Wherein, the computer program includes computer Program code, the computer program code can be source code form, object identification code form, executable file or certain centres Form etc..The computer-readable medium may include: can carry the computer program code any entity or device, Recording medium, USB flash disk, mobile hard disk, magnetic disk, CD, computer storage, read-only memory (ROM, ROM, Read-Only Memory), random access memory (RAM, Random Access Memory) and software distribution medium etc..

Although specifically showing and describing the present invention in conjunction with preferred embodiment, those skilled in the art should be bright It is white, it is not departing from the spirit and scope of the present invention defined by the appended claims, it in the form and details can be right The present invention makes a variety of changes, and is protection scope of the present invention.

Claims

1. a kind of Web hostile attack identification method, it is characterised in that: the following steps are included:

(1), disaggregated model is established

Step 1: the blacklist and white list sample data of a large amount of web access data are acquired respectively, to the net in sample data Network address carries out unified decoding, and after being converted into unified coded format, decoded network address is carried out character processing, The unification of format is influenced and carried out to avoid meaningless character；

Step 2: feature extraction is carried out to by the processed sample data of step 1 by TF-IDF algorithm, calculates each sample The characteristic value of notebook data；

Step 3: it according to the characteristic value of blacklist and white list sample data, is trained, is obtained by algorithm of support vector machine Disaggregated model and preservation after training, the disaggregated model is for distinguishing blacklist data and white list data；

(2), data identify

Step 4: after being decoded to the network address of the access data received, it is converted into volume used in step 1 Code format, while decoded network address is subjected to character processing；

Step 5: feature extraction is carried out to by the processed data of step 4 by TF-IDF algorithm, calculates the feature of data Value；

Step 6: according to the characteristic value of data, access data is identified by disaggregated model, judge whether it belongs to black name Forms data.

2. Web hostile attack identification method according to claim 1, it is characterised in that: the character processing are as follows: will own Letter is uniformly set as upper case character or lowercase character, and all Chinese and number are uniformly set as specific character, described specific Character is the character different from the character in network address in addition to Chinese and number.

3. Web hostile attack identification method according to claim 1, it is characterised in that: the calculating process of the characteristic value Are as follows:

(1), the length of word in data is set as s, and data are divided into the word that multiple length are s in sequence according to the length s of word；

(3), calculate the inverse document frequency IDF:IDF=1+ln (p/q) of each word, wherein p is data count, q be comprising The data number of the word；

(4), the characteristic value TF-IDF of the data is calculated:

4. Web hostile attack identification method according to claim 3, it is characterised in that: the length s=3 of institute's predicate.

5. Web hostile attack identification method according to claim 1, it is characterised in that: screened step 2 in step 3 Training data of middle maximum 1000 sample datas of TF-IDF value as algorithm of support vector machine.

6. a kind of Web malicious attack identification terminal equipment, it is characterised in that: including processor, memory and be stored in described The computer program run in memory and on the processor, the processor are realized such as when executing the computer program The step of Claims 1 to 5 the method.

7. a kind of computer readable storage medium, the computer-readable recording medium storage has computer program, and feature exists In realization is such as the step of Claims 1 to 5 the method when the computer program is executed by processor.