CN104735074A

CN104735074A - Malicious URL detection method and implement system thereof

Info

Publication number: CN104735074A
Application number: CN201510149110.9A
Authority: CN
Inventors: 汪德嘉; 叶芸; 胡振中; 葛彦霆; 刘伟
Original assignee: JIANGSU PAYEGIS INFORMATION TECHNOLOGY Co Ltd
Current assignee: JIANGSU PAYEGIS INFORMATION TECHNOLOGY Co Ltd
Priority date: 2015-03-31
Filing date: 2015-03-31
Publication date: 2015-06-24

Abstract

The invention discloses a malicious URL detection method and a system. The method comprises the steps of splitting a URL to be detected into character strings according to a URL grammar and a semantic structure in the RFC1738 standard; analyzing, completing and modifying the character strings which are obtained by splitting; aiming at a new URL formed by the processed character strings, traversing and matching a URL knowledge base; judging whether the new URL contains malicious features and belongs to a short URL according to rules, and if the new URL belongs to the short URL, restoring the short URL into a long URL; finally, extracting the features from the URL knowledge base, applying a sorting algorithm training model through machine learning, and predicting the malicious property of the URL. By applying the method and the system, the flexibility and eversporting property of URL forms are solved, a newly-emerging malicious website can be recognized, the harm from the malicious URL can be effectively resisted, and the safety performance of user information can be substantially improved.

Description

A kind of malice URL detection method and realize system

Technical field

The present invention relates to a kind of computer information safe authentication techniques, comprehensive utilization computer network and machine learning algorithm realize, the validation of information etc. that can be applicable to perform in each terminal and wealth comes and goes needs to carry out system and the field of authentication, is specifically related to a kind of malice URL detection method and realizes system.

Background technology

According to RFC1738 specification, URL(Uniform Resource Locator) syntax format be generally expressed as following form: " <scheme>:<scheme-specific-part> ", a URL contains Scenario Name (<scheme>) and scheme description part (<scheme-specific-part>), and scheme describes the complete scheme used by it of part and decides.Scenario Name is http protocol normally, if it is http protocol that scheme section omits also acquiescence, then to describe portion-form as follows for its corresponding scheme: " //<user>:<password >@<host>:<portGrea tT.GreaT.GT/<url-path> <searchpart> ", wherein " <user>:<password >@", ": <password> ", ": <port> ", "/<url-path> <searchpart> " and " <searchpart> " are likely omitted." <searchpart> " is inquiry string, can ignore in the process detecting URL whether malice, that is: reject " <searchpart> " and above "? " the new URL obtained is identical on essential meaning with the malicious of URL to be detected.

Along with the fast development of microblogging, short URL service enlivens gradually.Short URL, as the term suggests be exactly in form shorter network address.Short network address service can be shortened a fourdrinier wire location, thus conveniently on social networks with microblogging, shares link.Because the contents such as microblogging generally have number of words to limit, longer URL address can extrude the space of text, and the service of short network address just in time solves this problem, can replace originally tediously long network address by short network address service by brief network address.But this also brings potential safety hazard, because short network address is all adopt the algorithm of compression to generate, this makes the short network address of some malice more have disguise, detects simultaneously add difficulty to URL.But the malicious detection of short URL is very urgent, " Symantec internet security threaten report 16 " display, 2010, nearly the malicious link of 2/3 all have employed short chain and connects, and reaches millions of in global range.

The malice URL detection method of current industry mainly comprises: based on the method for static strings coupling, deposited by malice URL hereof, carry out matching judgment to URL to be detected; Contrast based on cryptographic Hash detects, and extracts and preserves the cryptographic Hash of malice URL, calculate cryptographic Hash and contrast judgement to URL to be detected; Based on message and the content at URL place, extract the page resource that the message content at malice URL place is corresponding with this URL, judge whether to comprise hostile content to URL to be detected.These above-mentioned methods can process most malice URL test problems, the malicious intrusions that unfortunately present terminal equipment cannot evade URL flexibility completely, normal sex change causes, the moment threatens the system safety of user terminal and prior property safety etc.

Summary of the invention

The present invention is directed to the active demand of prior art, propose a kind of malice URL detection method and realize system, to providing a kind of detection mode more flexible, more safe and reliable URL detection technique solution.

The technical solution that above-mentioned first object of the present invention is achieved is: a kind of malice URL detection method, it is characterized in that comprising step: S1, according to RFC1738 specification, URL to be detected is split as character string, and adopt completion, the mode of amendment character string arranges and obtains the identical new URL of essence; S2, the new URL obtained by S1 travels through coupling in URL knowledge base, exports being contained in the URL testing result of also directly mating in URL knowledge base; S3, carries out judgement and classification and Detection to the new URL that cannot find in URL knowledge base by predefine rule, comprises step S31-S33:S31, rule judgment is comprised to the URL testing result output of malice feature; S32, be the URL of briefization to rule judgment, adopt reducing process convert the URL of corresponding length to and perform step S2; S33, the URL that cannot judge for rule, extract feature field and build prediction file, and by training under line and the grader constantly updated to do model prediction to prediction file malicious and export.

Further, the mode of completion described in step S1, amendment character string refers to: towards the character string splitting URL gained to be detected, judge whether there is agreement or whether comprise inquiry string, the situation of disappearance agreement is supplemented to the http protocol of acquiescence; This inquiry string and " " character above thereof are removed to the situation comprising inquiry string, forms the new URL that essence is identical.

Further, the knowledge base of URL described in step S2 comprises the TLD of normal URL, malice URL and normal URL, and the TLD of URL to be detected or URL to be detected is present in URL knowledge base and directly mates, and URL testing result is exported.

Further, the rule of predefine described in step S3 comprises malice feature screening and briefization URL screening, wherein malice feature screening refers to the URL to be detected that mark only comprises English alphabet or numeral, briefization URL screening refer to identify comprise short URL service provider, URL only has three layers of path and only comprises the URL to be detected of English alphabet or numeral in third layer.

Further, under the line of grader described in step S33, training method is: the correlated characteristic therefrom extracting URL based on URL knowledge base builds training file, then adopt sorting algorithm to carry out training, optimizing and preservation model, wherein sorting algorithm is at least decision tree, SVMs, logistic regression, random forest or multiple multiplexing; Be trained under the line of described grader and regularly or non-regularly upgrade with the change of URL knowledge base, when the URL that cannot judge predefine rule carries out malicious detection, the correlated characteristic field extracting URL builds prediction file, then adopt the model preserved to detect prediction file, predicted the outcome and export.

The technical solution that above-mentioned second object of the present invention is achieved is: what a kind of malice URL detected realizes system, it is characterized in that to be connected with model prediction module by sorting module, matching module, regular identification module is formed, wherein said sorting module has the receiving terminal of URL to be detected and for splitting by adopting URL to be detected according to RFC1738 specification, the mode of completion, amendment arranges the processing unit obtaining the identical new URL of essence; Described matching module receives and is connected to sorting module and comprises URL knowledge base and matching treatment device in matching module, described regular identification module has predefine rule and point kind processing device based on this rule, and regular identification module correspondence is classified, the result of judgement exports respectively and is connected to model prediction module, sorting module or testing result output, described model prediction module is trained and the grader of continuous updating under having line, and has the processing unit based on grader, prediction file being done to malicious detection.

Further, the TLD of the normal URL of continuous updating, malice URL and normal URL is comprised in the knowledge base of URL described in matching module.

Apply technique scheme of the present invention, compare to tradition malice URL detection method and there is significant technique effect: for flexibility, the normal sex change of URL form, emerging malicious websites can be identified, effectively improve the accuracy that URL malice detects, resist the harm of malice URL, significantly improve the security performance of user profile.

Accompanying drawing explanation

Fig. 1 is the operational flow diagram of the present invention's malice URL detection method.

Fig. 2 is the training flow chart of the present invention's malice URL detection model.

Fig. 3 be the present invention malice URL detect realize system block diagram.

Embodiment

The present invention is directed to the network security demand of the develop rapidlys such as existing mobile payment, innovation proposes a kind of malice URL detection system solution, for user provides safe and reliable network environment.In order to clearly set forth object of the present invention, feature and advantage, below in conjunction with accompanying drawing, the invention will be further described.According to RFC1738 specification, URL normalized form is generally: " <scheme>: //<user>:<password >@<host>:<portGrea tT.GreaT.GT/<url-path> <searchpart> ", according to Such analysis, it is usual is equivalent with the malicious testing result of " <scheme>: //<host>/<url-path > ".

Below respectively from detection method and realize system two aspects and introduce this technical solution respectively.First, the present invention's innovation proposes a kind of more perfect, malice URL detection method that sweetly disposition is stronger.Its step summary comprises: S1, according to RFC1738 specification, URL to be detected is split as character string, and adopt completion, the mode of amendment character string arranges and obtains the identical new URL of essence; S2, the new URL obtained by S1 travels through coupling in URL knowledge base, exports being contained in the URL testing result of also directly mating in URL knowledge base; S3, carries out judgement and classification and Detection to the new URL that cannot find in URL knowledge base by predefine rule, comprises step S31-S33:S31, rule judgment is comprised to the URL testing result output of malice feature; S32, be the URL of briefization to rule judgment, adopt reducing process convert the URL of corresponding length to and perform step S2; S33, the URL that cannot judge for rule, extract feature field and build prediction file, and by training under line and the grader constantly updated to do model prediction to prediction file malicious and export.

Specifically, refinement ground: the mode of completion described in above-mentioned steps S1, amendment character string refers to: the character string splitting gained towards URL to be detected according to RFC specification, therefrom judge whether there is agreement " <scheme> " or whether comprise inquiry string " <searchpart> ", the situation of disappearance agreement is supplemented to the http protocol of acquiescence; This inquiry string and " " character above thereof are removed to the situation comprising inquiry string, forms the new URL that essence is identical.

The knowledge base of URL described in step S2 comprises the TLD of the current normal URL, malice URL and the normal URL that have determined etc. and continuous updating, the TLD of URL to be detected or URL to be detected is present in URL knowledge base and directly mates, if existed, URL testing result is directly exported.

The rule of predefine described in step S3 comprises malice feature screening and briefization URL screening, wherein malice feature screening refers to the URL to be detected that mark only comprises English alphabet or numeral, briefization URL screening refer to identify comprise short URL service provider, URL only has three layers of path and only comprises the URL to be detected of English alphabet or numeral in third layer.Particularly, detect in URL whether comprise clear and definite malice feature; Whether comprise the feature of short URL.Described clear and definite malice feature, as: whether only comprise English character in URL, or whether only comprise numerical value etc.Described short URL feature, as: comprise short URL service provider in URL, " is.gd ", " bit.ly ", " j.mp ", " dwz.cn ", " t.cn ", " sina.lt ", " suo.im ", " taourl.com ", " tao.bb ", " 955.cc ", " baid.ws " etc., and URL only has three layers of path, and third layer is only containing letter or numerical character.If URL is detected as short URL, then by catching " Location " field in its redirected page as corresponding long URL.

Further, under the line of grader described in step S33, training method is: the correlated characteristic therefrom extracting URL based on URL knowledge base builds training file, then adopts sorting algorithm to carry out training, optimizing and preservation model; Concrete process is: first from URL knowledge base, extract feature field, shown in the feature field table 1 specific as follows adopted at present:

wherein comprise: the source of field name, field type, field meanings and field, the all feature constructions extracted are become training file, then multiple sorting algorithm is adopted to carry out training and optimizing, as: decision tree, SVMs, logistic regression, random forest etc., the decision Tree algorithms of final selection sort effect optimum, and preserve decision-tree model.Concrete model training flow process is illustrated in fig. 2 shown below.Be trained under the line of this grader and regularly or non-regularly upgrade with the change of URL knowledge base.

When the URL that cannot judge predefine rule carries out malicious detection, the correlated characteristic field extracting URL builds prediction file, then adopts the model preserved to detect prediction file, is predicted the outcome and export.Here adopt grader training, line only used under line, effectively can place for the distorting of this model, attack, ensure the accurate effect that malice URL detects further.

As shown in Figure 3, that detects for a kind of malice URL provided by the invention realizes system, this realizes system and to be connected with model prediction module by sorting module, matching module, regular identification module and to form, wherein said sorting module has the receiving terminal of URL to be detected and for splitting by adopting URL to be detected according to RFC1738 specification, the mode of completion, amendment arranges the processing unit obtaining the identical new URL of essence; Described matching module receives and is connected to sorting module and comprises URL knowledge base and matching treatment device in matching module, described regular identification module has predefine rule and point kind processing device based on this rule, and regular identification module correspondence is classified, the result of judgement exports respectively and is connected to model prediction module, sorting module or testing result output, described model prediction module is trained and the grader of continuous updating under having line, and has the processing unit based on grader, prediction file being done to malicious detection.

Wherein, the TLD of the normal URL of continuous updating, malice URL and normal URL is comprised in the knowledge base of URL described in matching module.

Detecting the implementation of solution for understanding this creation further, showing below by way of some specific embodiments are clear.

Embodiment one, if url to be detected: " abcdefg ", concrete determining step is as follows:

(1) first according to implementation step S1, the amended url of completion is " http://abcdefg ";

(2) then according to implementation step S2, this url is judged not in URL knowledge base;

(3) enter step S3 again, go out in url to contain malice feature (performing S31) by rule-based filtering: url scheme describes part only containing English character, so finally judge that url to be detected is as malice.If rul scheme describes the situation of part only containing numeral in like manner. if embodiment two url to be detected: " http://www.dwz.cn/t05ZQ ", concrete determining step is as follows:

(1) first according to implementation step S1, the amended url of completion is still " http://www.dwz.cn/t05ZQ ";

(2) then according to implementation step S2, this url is not in URL knowledge base;

(3) step S3 is entered again, short url feature (performing S32) is wherein contained by rule judgment, then reduced growth url: " http://search.jd.com/search keyword=%E5%8E%9F%E5%88%9B & enc=utf-8 & qr=& qrst=UNEXPAND & et=& as=1 & rt=1 & stop=1 & vt=2 & sttr=1 & cid2=1343 & ev=exprice_199-599%40 & uc=0 & lastprice=200-299#select ",

(4) for long url implementation step S1, after completion amendment be: " http://search.jd.com/search ";

(5) to amended url implementation step S2, this url is present in url database, and is labeled as normal, so finally judge that url to be detected is normal.

if embodiment three url to be detected: " http://shop.ldangdang.com/14416 ", concrete determining step is as follows:

(1) first according to implementation step S1, the amended url of completion is still " http://shop.ldangdang.com/14416 ";

(3) enter step s3 again, wherein not only there is no malice URL feature by rule judgment, nor comprise short URL feature;

(4) finally enter step S33, adopt model prediction module to predict, predict the outcome as malice, so finally judge that url to be detected is as malice.Model prediction module wherein used is trained and the grader of continuous updating under having line, and has the processing unit based on grader, prediction file being done to malicious detection.

Visible in sum, application the present invention malice URL detection method and realize the technical scheme of system, compare to tradition malice URL detection method and there is significant technique effect: for flexibility, the normal sex change of URL form, emerging malicious websites can be identified, effectively improve the accuracy that URL malice detects, resist the harm of malice URL, significantly improve the security performance of user profile.

Claims

1. a malice URL detection method, is characterized in that comprising step: S1, according to RFC1738 specification, URL to be detected is split as character string, and adopt completion, the mode of amendment character string arranges and obtains the identical new URL of essence; S2, the new URL obtained by S1 travels through coupling in URL knowledge base, exports being contained in the URL testing result of also directly mating in URL knowledge base; S3, carries out judgement and classification and Detection to the new URL that cannot find in URL knowledge base by predefine rule, comprises step S31-S33;

S31, the URL testing result comprising malice feature to rule judgment export;

S32, be the URL of briefization to rule judgment, adopt reducing process convert the URL of corresponding length to and perform step S2;

S33, the URL that cannot judge for rule, extract feature field and build prediction file, and by training under line and the grader constantly updated to do model prediction to prediction file malicious and export.

2. malice URL detection method according to claim 1, it is characterized in that: the mode of completion described in step S1, amendment character string refers to: towards the character string splitting URL gained to be detected, judge whether there is agreement or whether comprise inquiry string, the situation of disappearance agreement is supplemented to the http protocol of acquiescence; This inquiry string and " " character above thereof are removed to the situation comprising inquiry string, forms the new URL that essence is identical.

3. malice URL detection method according to claim 1, it is characterized in that: the knowledge base of URL described in step S2 comprises the TLD of normal URL, malice URL and normal URL, the TLD of URL to be detected or URL to be detected is present in URL knowledge base and directly mates, and URL testing result is exported.

4. malice URL detection method according to claim 1, it is characterized in that: the rule of predefine described in step S3 comprises malice feature screening and briefization URL screening, wherein malice feature screening refers to the URL to be detected that mark only comprises English alphabet or numeral, briefization URL screening refer to identify comprise short URL service provider, URL only has three layers of path and only comprises the URL to be detected of English alphabet or numeral in third layer.

5. malice URL detection method according to claim 1, it is characterized in that: under the line of grader described in step S33, training method is: the correlated characteristic therefrom extracting URL based on URL knowledge base builds training file, then adopt sorting algorithm to carry out training, optimizing and preservation model, wherein sorting algorithm is at least decision tree, SVMs, logistic regression, random forest or multiple multiplexing; Be trained under the line of described grader and regularly or non-regularly upgrade with the change of URL knowledge base, when the URL that cannot judge predefine rule carries out malicious detection, the correlated characteristic field extracting URL builds prediction file, then adopt the model preserved to detect prediction file, predicted the outcome and export.

6. one kind malice URL detect realize system, it is characterized in that to be connected with model prediction module by sorting module, matching module, regular identification module is formed, wherein said sorting module has the receiving terminal of URL to be detected and for splitting by adopting URL to be detected according to RFC1738 specification, the mode of completion, amendment arranges the processing unit obtaining the identical new URL of essence; Described matching module receives and is connected to sorting module and comprises URL knowledge base and matching treatment device in matching module, described regular identification module has predefine rule and point kind processing device based on this rule, and regular identification module correspondence is classified, the result of judgement exports respectively and is connected to model prediction module, sorting module or testing result output, described model prediction module is trained and the grader of continuous updating under having line, and has the processing unit based on grader, prediction file being done to malicious detection.

7. according to claim 6 malice URL detect realize system, it is characterized in that: the TLD comprising the normal URL of continuous updating, malice URL and normal URL in the knowledge base of URL described in matching module.