CN109948339A

CN109948339A - A kind of malicious script detection method based on machine learning

Info

Publication number: CN109948339A
Application number: CN201910210330.6A
Authority: CN
Inventors: 孙波; 李应博; 张伟; 司成祥; 张建松; 李胜男; 毛蔚轩; 盖伟麟; 房婧; 王亿芳; 胡晓旭; 王梦禹
Original assignee: National Computer Network and Information Security Management Center
Current assignee: National Computer Network and Information Security Management Center
Priority date: 2019-03-20
Filing date: 2019-03-20
Publication date: 2019-06-28

Abstract

The malicious script detection method based on machine learning that the present invention provides a kind of, this method step include: S1. building network simulating environment, acquire the sample data of Webshell script；S2. data prediction is carried out to collected sample data, and analyzes the traffic characteristic extracted in sample data；S3. it is based on traffic characteristic, constructs Internet Intrusion Detection Model；S4. the Internet Intrusion Detection Model is disposed in server end, access server data on flows detects the network intrusions behavior of server end；S5. it will test result in system interface real-time display, and be recorded into detection log.

Description

A kind of malicious script detection method based on machine learning

Technical field

The present invention relates to field of information security technology more particularly to a kind of malicious script detection methods.

Background technique

In internet+epoch, server security is faced with the security threat from network intrusions, wherein Webshell It is a kind of malicious script that typical attacker uses, the purpose is to upgrade and safeguard to WEB application journey under attack The permanent access of sequence.Webshell itself cannot be attacked or using long-range loophole, therefore it is the second step of attack always.Attack Person can use common loophole, if SQL injection, telefile include (RFI), FTP, even with cross-site script (XSS) As a part of attack, to upload malicious script.General utility functions includes but is not limited to that shell-command executes, code executes, number It is enumerated according to library and file management.Therefore, the invasion for how effectively detecting Webshell becomes a problem in field.

Summary of the invention

It is a primary object of the present invention to propose a kind of malicious script detection method based on machine learning, it is intended to solve such as How about how automatically and efficiently the network intrusion event at detection service device end.

To achieve the above object, a kind of malicious script detection method based on machine learning provided by the invention, this method Key step includes:

S1. network simulating environment is constructed, the sample data of Webshell script is acquired；

S2. data prediction is carried out to collected sample data, and analyzes the traffic characteristic extracted in sample data；

S3. it is based on traffic characteristic, constructs Internet Intrusion Detection Model；

S4. the Internet Intrusion Detection Model is disposed in server end, access server data on flows detects server The network intrusions behavior at end.

S5. it will test result in system interface real-time display, and be recorded into detection log.

Preferably, in step S1 further include: when constructing network simulating environment, according to webshell on multiple servers Type, attack behavior finish writing automatized script, use Network Sniffing tool collect Webshell flow.

Preferably, the traffic characteristic in step S2 be keyword, webpage path structure hierachy number, cookie key assignments logarithm, Return to one of multiple features such as structure of web page similarity, POST/GET entropy, cookie key-value pair entropy or a variety of.

Preferably, the Internet Intrusion Detection Model in step S3 is using adboost, SVM, random forest, logistic regression etc. One of machine learning algorithm carries out model training.

Malicious script detection method proposed by the present invention based on machine learning, by carrying out data to Webshell flow Analysis and feature extraction construct Internet Intrusion Detection Model, so as to which the network intrusions from Webshell are effectively detected Flow, to improve the network security performance of user.

Detailed description of the invention

Fig. 1 is flow chart of the method for the present invention.

Specific embodiment

Malicious script detection method provided by the invention based on machine learning, this method key step include:

Webshell sample lacks in true environment, and substantially in tens of thousands of http flows, all difficulty has one Flow caused by webshell.Therefore for machine learning, high quality, multi-quantity sample will be challenge. In order to solve the thorny problem of this sample hardly possible, we specially simulate the environment for having built webshell invasion, according to The type of webshell, the behavior of attack finish writing automatized script, and when operation generates a large amount of webshell flows, are smelt using network Spy tool (such as Wireshark, Tcpdump etc.) has collected Webshell flow.

After the expertise knowledge in Feature Engineering, being collected into and actual historical data statistical analysis, start special Sign analysis.

1. the feature based on keyword.

Behavioural analysis for webshell itself, it has for system calling, system configuration, database, file Operational motion, its behavior determine that mostly band parameter has some apparent features in its data traffic, in addition closes again Decode operation first is carried out to flow before keyword matching.

2. get/post number of parameters in flow

It has been observed that the number of parameters of in general webshell get/post is fewer, a feature can be used as.

3. the comentropy of get/post in flow

General request all can submit data to server, and webshell is no exception.But if the data submitted are passed through Encryption or coded treatment, entropy will become larger.For normal web operation system, if submitting number to a certain URI According to entropy it is obvious bigger than normal in other pages, then the corresponding sound code file of the URI is just more suspicious.And it is logical generally to have done encryption The webshell of letter submits the entropy of data can be bigger than normal, so can detected.

4. the feature extraction based on cookie

In normal http access, because http access is stateless agreement, server will not safeguard visitor automatically The contextual information at family then saves contextual information using session.Session is stored in server end, in order to For the cost of reduction server storage then when there is http request, server can return to a cookie to record sessionID And it is stored in browser local, cookie can be carried in request when accessing next time.The content of cookie specifically includes that name Word, value, expired time, path and domain.Path constitutes the sphere of action of cookie together with domain.It analyzes according to observations, webShell Generated cookie some is sky, although the structure quantum for having key-value pair having is considerably less, and is named without real The meaning on border.It is used to distinguish webShell and normal website visiting so extracting this feature.

5. returning to structure of web page similarity value

The page that Webshell is much returned have structural similarity, can extract this feature of structure of web page similarity into Row compares.Mentality of designing is to compare with the acquired webshell structure of web page similarity generated, with return webpage Structural similarity is as a feature.

6. the webpage path number of plies

The webpage path of Webshell can be deep, and webpage is concealed deep, is not easy to be found by normal browsing person.

7. access time section

Webshell is compared with regular traffic, and the time of browsing is discrepant, it will usually which selection is in normal discharge rareness Time access.Therefore feature is found time as a dimension.According to time big category feature, can be unfolded it is several under it is several Small category feature, which in one day period, in one week what day, which in 1 year in week, which in 1 year in season, working day, Weekend.

8. whether there is or not referer

In flow, if the page up webpage that webpage does not jump, referer parameter will be sky, therefore Select this feature as a kind of auxiliary judgment.

It can according to need in actual implementation and choose adboost, SVM, random forest, logistic regression scheduling algorithm progress model Training, general default choice random forests algorithm is as model training algorithm.

The embodiment of the present invention is described with above attached drawing, but the invention is not limited to above-mentioned specific Embodiment, the above mentioned embodiment is only schematical, rather than restrictive, those skilled in the art Under the inspiration of the present invention, without breaking away from the scope protected by the purposes and claims of the present invention, it can also make very much Form, all of these belong to the protection of the present invention.

Claims

1. a kind of malicious script detection method based on machine learning, this method step include:

S4. the Internet Intrusion Detection Model is disposed in server end, access server data on flows detects server end Network intrusions behavior；

2. the method as described in claim 1, it is characterised in that: in the step S1 further include: in building network simulating environment When, automatized script is finished writing according to the behavior of the type of webshell, attack on multiple servers, uses Network Sniffing tool Collect Webshell flow.

3. the method as described in claim 1, it is characterised in that: the traffic characteristic in the step S2 is keyword, webpage road Gauge structure hierachy number, returns to structure of web page similarity, POST/GET entropy, cookie key-value pair entropy at cookie key assignments logarithm Etc. one of multiple features or a variety of.

4. the method as described in claim 1, it is characterised in that: the Internet Intrusion Detection Model in the step S3 uses One of machine learning algorithms such as adboost, SVM, random forest, logistic regression carry out model training.