CN107872452A

CN107872452A - A kind of recognition methods of malicious websites, device, storage medium and program product

Info

Publication number: CN107872452A
Application number: CN201711010692.8A
Authority: CN
Inventors: 邹荣珠
Original assignee: Neusoft Corp
Current assignee: Neusoft Corp
Priority date: 2017-10-25
Filing date: 2017-10-25
Publication date: 2018-04-03

Abstract

This application provides a kind of recognition methods of malicious websites, device, storage medium and program product, wherein, the recognition methods of malicious websites includes：Obtain the network address to be identified of website to be identified；Based on the content of network address to be identified, target signature is extracted from network address to be identified；Identify whether website to be identified is malicious websites according to target signature.Recognition methods, device, storage medium and the program product for the malicious websites that the application provides make full use of the abundant information of network address in itself, network address is identified based on the target signature extracted from network address, the discrimination of this identification method is higher, and this identification method need not load webpage, recognition efficiency is higher.

Description

A kind of recognition methods of malicious websites, device, storage medium and program product

Technical field

The present invention relates to website identification technology field, more particularly to a kind of recognition methods of malicious websites, device, storage Jie Matter and program product.

Background technology

With the development of social informatization, internet has been deep into the various aspects of social life.Therefore, wire side is interconnected The security attack faced is also more frequent and severe.Network address is the entrance of customer access network resource, and the information itself contained can With for detecting malicious websites.Some methods using domain name identification malicious websites, still, these methods in the prior art be present Discrimination is relatively low, and the recognition methods having also needs to load webpage, and recognition efficiency is relatively low.

The content of the invention

In view of this, the invention provides a kind of recognition methods of malicious websites, device, storage medium and program product, It is relatively low to solve malicious websites recognition methods discrimination of the prior art, and the recognition methods having also needs to load webpage, The problem of causing recognition efficiency relatively low, its technical scheme are as follows：

A kind of recognition methods of malicious websites, including：

Obtain the network address to be identified of website to be identified；

Based on the content of the network address to be identified, target signature is extracted from the network address to be identified；

Identify whether the website to be identified is malicious websites according to the target signature.

Preferably, after the network address to be identified is obtained, methods described also includes：

Determine the network address to be identified whether in the black and white lists of website；

If the network address to be identified in the blacklist of website, determines that the website to be identified is malicious websites；

If the network address to be identified in the white list of website, determines that the website to be identified is normal website；

If it is determined that the network address to be identified not in the website black and white lists, is then based on the net to be identified described in execution The content of location, target signature is extracted from the network address to be identified.

Preferably, it is determined that the network address to be identified not in the website black and white lists after, perform described in be based on institute The content of network address to be identified is stated, before extracting target signature from the network address to be identified, methods described also includes：

Domain name and trust set of domains based on the network address to be identified, wait to know according to determining similarity mode algorithm Whether other website is malicious websites.

Wherein, it is described to identify whether the website to be identified is malicious websites according to the target signature, including：

Identify whether the website to be identified is malicious websites according to text feature, statistical nature and/or protocol characteristic.

Wherein, it is described to identify whether the website to be identified is malicious websites according to text feature and/or protocol characteristic, bag Include：

The content of the network address to be identified is identified, obtains text feature and/or protocol characteristic in the network address to be identified；

Judge preset characters whether are included in the text feature and/or the protocol characteristic, the preset characters are can Determine the character for the network address that network address to be identified is malicious websites；

If the preset characters are included in the text feature and/or the protocol characteristic, it is determined that the net to be identified Stand as malicious websites.

The recognition methods of the malicious websites also includes：

If not including the preset characters in the text feature and the protocol characteristic, to the text feature and/or The protocol characteristic is counted, and obtains the statistical nature；

Determine whether the website to be identified is malicious websites according to the statistical nature.

A kind of identification device of malicious websites, including：Acquisition module, characteristic extracting module and the first identification module；

The acquisition module, for obtaining the network address to be identified of website to be identified；

The characteristic extracting module, for the content based on the network address to be identified, extracted from the network address to be identified Target signature；

First identification module, for identifying whether the website to be identified is malice net according to the target signature Stand.

Preferably, described device also includes：Second identification module；

Second identification module, after obtaining the network address to be identified in the acquisition module, it is determined that described treat Network address is identified whether in the black and white lists of website, if the network address to be identified in the blacklist of website, determines the net to be identified Stand as malicious websites, if the network address to be identified in the white list of website, determines that the website to be identified is normal website；

The characteristic extracting module, specifically for determining the network address to be identified not described in second identification module When in the black and white lists of website, based on the content of the network address to be identified, target signature is extracted from the network address to be identified.

Preferably, described device also includes：3rd identification module；

3rd identification module, for determining the network address to be identified not in the website in second identification module After in black and white lists, content of the characteristic extracting module based on the network address to be identified, carried from the network address to be identified Before taking target signature, domain name and trust set of domains based on the network address to be identified, determined according to similarity mode algorithm Whether the website to be identified is malicious websites.

Wherein, first identification module, specifically for being identified according to text feature, statistical nature and/or protocol characteristic Whether the website to be identified is malicious websites.

A kind of computer-readable recording medium, instruction is stored with the computer readable storage medium storing program for executing, when the instruction When running on network devices so that the network equipment performs the recognition methods of above-mentioned malicious websites.

A kind of computer program product, when the computer program product is run on network devices so that the network Equipment perform claim requires the recognition methods of above-mentioned malicious websites.

Above-mentioned technical proposal has the advantages that：

Recognition methods, device, storage medium and the program product of malicious websites provided by the invention, make full use of network address sheet The abundant information of body, because network address contains the information such as domain name, access protocol, path in itself, therefore, it can be included based on network address These information are extracted from the network address to be identified of website to be identified knows another characteristic, and then the spy based on extraction for carrying out website Sign identifies whether website to be identified is malicious websites, and the discrimination of this identification method is higher, and because the recognition methods is It is identified based on the information that network address includes in itself, without that need not be loaded by the information in webpage, i.e. identification process Webpage, therefore recognition efficiency is higher.

Brief description of the drawings

In order to illustrate more clearly about the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing There is the required accompanying drawing used in technology description to be briefly described, it should be apparent that, drawings in the following description are only this The embodiment of invention, for those of ordinary skill in the art, on the premise of not paying creative work, can also basis The accompanying drawing of offer obtains other accompanying drawings.

Fig. 1 is a schematic flow sheet of the recognition methods of malicious websites provided in an embodiment of the present invention；

Fig. 2 be malicious websites provided in an embodiment of the present invention recognition methods in, according to text feature, statistical nature and/ Or protocol characteristic identify website to be identified whether be malicious websites implementation process schematic flow sheet；

Fig. 3 is another schematic flow sheet of the recognition methods of malicious websites provided in an embodiment of the present invention；

Fig. 4 is domain name and letter based on network address to be identified in the recognition methods of malicious websites provided in an embodiment of the present invention Appoint set of domains, according to similarity mode algorithm determine website to be identified whether be malicious websites implementation process flow illustrate Figure；

Fig. 5 is a structural representation of the identification device of malicious websites provided in an embodiment of the present invention；

Fig. 6 is another structural representation of the identification device of malicious websites provided in an embodiment of the present invention.

Embodiment

Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete Site preparation describes, it is clear that described embodiment is only part of the embodiment of the present invention, rather than whole embodiments.It is based on Embodiment in the present invention, those of ordinary skill in the art are obtained every other under the premise of creative work is not made Embodiment, belong to the scope of protection of the invention.

For user when accessing website, it is normal website or malicious websites that can not know its website accessed, if accessed The information for being malicious websites, user may being stolen, unnecessary loss is brought to user, based on this, the present invention provide one The recognition methods of kind of malicious websites, the recognition methods of the malicious websites can apply to the network equipment, and the network equipment is can be with The equipment for accessing network, such as terminal device (such as mobile phone, PC, notebook computer, PAD), router, gateway, interchanger etc., When user, which is based on a network address, accesses website, the network address can be identified for the network equipment, if identified, corresponding to the network address Website is malicious websites, then it can be intercepted, or returns to prompt message, to prompt user that its website accessed is malice Website.

The recognition methods from the angle of the network equipment to malicious websites provided in an embodiment of the present invention is illustrated below.

Referring to Fig. 1, show the flow of the embodiment one of the recognition methods of malicious websites provided in an embodiment of the present invention Schematic diagram, the present embodiment can be applied to the network equipment, and the present embodiment may include steps of：

Step 101：Obtain the network address to be identified of website to be identified.

Network address is to lead to the address of website, and it includes abundant information, such as domain-name information, access protocol information etc., this Embodiment fully excavates the information that network address is included in itself, and the identification of malicious websites is carried out using these information.

Step 102：Based on the content of network address to be identified, target signature is extracted from network address to be identified.

Wherein, target signature can include text feature, statistical nature and/or protocol characteristic.Text feature can include The information of domain name in network address to be identified, protocol characteristic can include the information such as access protocol, access port in network address to be identified, system Meter feature can be that obtained feature is counted to text feature and/or protocol characteristic, for example, in network address domain name statistics letter Statistical information in path etc. in breath, network address.

Step 103：Identify whether website to be identified is malicious websites according to target signature.

Specifically, identify whether website to be identified is malice net according to text feature, statistical nature and/or protocol characteristic Stand.Text feature and protocol characteristic can characterize the characteristic of the characteristic of domain name and agreement in network address to be identified, it is to be understood that The characteristic of domain name and agreement from the characteristic of domain name and agreement in the network address of normal website is different in the network address of malicious websites, because This, based on the text feature and/or protocol characteristic extracted from network address to be identified can determine website to be identified whether be Malicious websites.Statistical nature also can be used in carrying out the identification of malicious websites, such as, domain name is included in the network address of malicious websites The randomness of character is higher, and in the network address of normal website domain name include character randomness it is low compared with ground, therefore, can be to text Feature is counted, and obtains the random nature that domain name includes character.

Identify whether website to be identified is the specific of malicious websites according to text feature, statistical nature and/or protocol characteristic Implementation process may refer to the explanation of subsequent embodiment.

The recognition methods of malicious websites provided in an embodiment of the present invention, the abundant information of network address in itself is made full use of, due to Network address contains the information such as domain name, access protocol, path in itself, and therefore, these information that can be included based on network address are to be identified Text feature, protocol characteristic and/or statistical nature are extracted in the network address to be identified of website, and then waits to know according to these feature recognitions Whether other website is malicious websites, and the discrimination of this recognition methods is higher, also, because the recognition methods is based only on network address The information itself included is identified, and without being identified by the information in webpage, i.e., identification process need not load Webpage, therefore recognition efficiency is higher.

Below again to identifying whether website to be identified is malice net according to text feature, statistical nature and/or protocol characteristic The specific implementation process stood illustrates, and is known referring to Fig. 2, showing according to text feature, statistical nature and/or protocol characteristic Website not to be identified whether be malicious websites specific implementation process schematic flow sheet, can include：

Step 201：The content of network address to be identified is identified, obtains text feature and/or protocol characteristic in network address to be identified.

Wherein, text feature can include the information of domain name in network address to be identified, and protocol characteristic can include net to be identified The information such as access protocol, access port in location.

Step 202：Judge whether include preset characters in text feature and/or protocol characteristic.

Wherein, preset characters are the character for the network address that can determine that network address to be identified is malicious websites.

In the present embodiment, preset characters can include：IP address, non-standard ports number, determine the word that address redirects Symbol, the subdomain of well-known website domain name, short domain name, appear in TLD in the subdomain of domain name, spcial character, https substrings, One or more in user sensitive information.

Step 203a：If preset characters are included in text feature and/or protocol characteristic, it is determined that website to be identified is malice Website.

Normal website is remembered in order to facilitate user, registered domain name is typically understood, without allowing user directly to access the IP of website Address, therefore, can be by judging in text feature whether comprising IP address to determine whether website to be identified is malicious websites, example Such as malice network address http://81.215.214.238/pp/.

Some malicious websites may add non-standard ports for the ease of illegal operation on network address is accessed, therefore, can By judging whether protocol characteristic includes non-standard ports to determine whether website to be identified is malicious websites.It is for example, normal The network address of http agreements uses 80 ports, 8080 ports, can use non-80 port, non-8080 port in malice network address, for example dislike Network address of anticipating http://www.syjsbmcl.com:13835.

Some malicious websites are redirected to confuse visitor in network address comprising address, and address, which redirects, to be referred to when use When person browses some network address, he is directed to another network address, for example, http://www.legitimate.com//http:// www.phishing.com.Therefore, can be determined by judging whether to include the character for determining address redirection in protocol characteristic Whether website to be identified is malicious websites, for example, the multiple http included in above-mentioned network address://.

The network address that usual https is accessed needs to carry out SSL certifications, is typically used as authentication, but malice by normal website Website may occur https as spelling character string, such as：http://https-www-paypal-it-webapps- Mpp-home.soft-hair.com/, therefore, when protocol characteristic indicates that the access mode of network address to be identified is non-https modes When, by judging whether include https substrings in protocol characteristic to determine whether website to be identified is malicious websites.

Malicious websites may include the sensitive information of user in the network address that non-https modes access, such as pay.bjkmsm.top/pay/wxpay.phpUsername=jiaming＆uid=10927＆gid=1＆top_uid= 4977＆hosturl=www.69111d.com, sensitive information username is contained in the network address, therefore, when protocol characteristic refers to , can also whether sensitive comprising user in protocol characteristic by judging when the access mode for showing network address to be identified is non-https modes Information come determine network address to be identified whether be malicious websites network address.Wherein, user sensitive information can include character string “Update、Confirm、User、Customer、Client、Suspend、Restrict、Hold、Verify、Account、 Login, Username, Password, SSN, SocialSecurity " etc..

Some malicious websites include the subdomain name of well-known domain name to increase letter in domain name, therefore, can be by judging text Whether the subdomain name comprising well-known domain name determines whether website to be identified is malicious websites in feature, such as：haosou- google55-baidu-qqcom-hao123-sogou.pinjianyun.com。

It is that one short and small domain name of registration, user can jump after this short domain name is accessed on specific website that domain name, which shortens, Go to longer true domain name.Some malicious websites can increase letter using the short domain name of well-known website, based on this, can pass through judgement Whether determine whether website to be identified is malicious websites comprising short domain name in text feature, such as the network address of malicious websites http://goo.gl/VmwBNh make use of the short domain name goo.gl of Google to confuse user, and allow user takes for accessing is paddy The website of song.

Some malicious websites can allow TLD to appear in the subdomain of domain name, therefore, can be by judging in text feature Whether TLD is included to determine whether website to be identified is malicious websites, such as：http:// Cgi.ebay.com.ebaymotors.732issapidll.private99dll.qqmoto rsqq.ebmdata.com, at this In the network address of malicious websites, the .com after ebay is substantially TLD, but it has been appeared in the subdomain of domain name.

In addition, also it can determine whether website to be identified is malice by judging whether to include spcial character in text feature Website.For example, network address http://www.paybankonline.com:Ac@50.28.170.70, it is regular according to domain name, '@' they Preceding part is ignored, and real domain name is " 50.28.170.70 ", therefore, whether text feature can be included '@', '-' Or the spcial character such as unicode determines whether website to be identified is malicious websites.

It should be noted that when there are a kind of above-mentioned preset characters in text feature or protocol characteristic, can not illustrate Website to be identified is malicious websites, and it may be malicious websites that can only illustrate website to be identified, i.e. by judge text feature or Whether a kind of above-mentioned preset characters are included to determine whether website to be identified is malicious websites in protocol characteristic, can be existed certain False drop rate, based on this, the present embodiment preferably combines above-mentioned various characters, i.e., by judging text feature and/or protocol characteristic In whether comprising above-mentioned various characters determine whether website to be identified is malicious websites simultaneously.

Step 203b：It is special to text feature and/or agreement if not including preset characters in text feature and protocol characteristic Sign is counted, and obtains statistical nature.

Wherein, statistical nature can include：In network address in the statistical information and network address in path in the statistical information of domain name It is one or more.

Specifically, the statistical information in path can include in network address：The quantity of subpath in the total length in path, path, One or more in path in the average length of subpath, path in the maximum length of subpath；The statistics of domain name in network address Information can include：One or more in the statistical information of domain name length and quantity, domain name in the statistical information of character, this hair Bright embodiment is not specifically limited.

Wherein, the statistical information of domain name length and quantity can further include：Subdomain in the total length of domain name, domain name One or more in quantity, domain name in the average length of subdomain, domain name in the maximum length of subdomain.

Wherein, the statistical information of character can further include in domain name：The randomness information that each character occurs in domain name (it can be characterized by Shannon entropys), domain name medial vowel character account for numeral in the ratio of domain name character, domain name and account for domain name character Repeat character (RPT) accounts for continuous number in the ratio of domain name character, domain name and accounted in the ratio of domain name character, domain name continuously in ratio, domain name Consonant account in the randomness information between the statistics ranking information of word in the ratio of domain name character, domain name, domain name character one Kind is a variety of.

It should be noted that in the statistics ranking information of word in obtaining above-mentioned domain name, first have to by default participle side Method, such as N-Gram the statistic methods (including uni-gram, bi-gram, tri-gram the statistic method) are to be identified The domain name of network address is segmented, and obtains multiple words, it is then determined that the ranking of each word, so as to obtain the statistics ranking of word in domain name Information.The domain name of generally normal website, otherwise may be malicious websites if including more word in the top in domain name Domain name.In the present embodiment, the process for determining the ranking of each word in the domain name of network address to be identified is：Target network address is obtained in advance Set, then target network address set is segmented using N-Gram statistics participles, word segmentation result is finally subjected to word frequency sequence, Obtain the statistics ranking information of word, word-based statistics ranking information determines in the domain name of network address to be identified each ranking of word. It should be noted that the network address in target network address set can be obtained by number of ways, for example, can be set from the network of network vendors It is standby to obtain, it can also be obtained from the third party website that some are increased income, can also be obtained by search engine etc..

It is further to note that the randomness information between above-mentioned domain name character can pass through the transition probability between character Characterize, if the transition probability in domain name between character is low, it is believed that the character randomness of domain name is high, the character of domain name is random Property high, the generally domain name of malicious websites.In a kind of possible implementation, HMM HMM can be used to analyze Transition probability between domain name character.Because HMM HMM is a kind of explicit status switch of basis, state is analyzed Transition probability, the method for finding its hidden state pattern, it is relatively adapted to the transition probability between analysis domain name character, that is, divided The randomness for analysing domain name is exactly to analyze the process of HMM patterns corresponding to domain name.

Step 204：Determine whether website to be identified is malicious websites according to statistical nature.

It should be noted that the property of website to be identified can be identified using a kind of above-mentioned statistical nature enumerated, show Example property, transition probability of the statistical nature between domain name character, if the transition probability between domain name character is less than preset value, It then can determine that the character randomness of domain name is high, and then can determine that the website to be identified is malicious websites.In fact, some normal nets The domain name stood also likely to be present the high situation of character randomness, and therefore, being identified using a kind of statistical nature can be present necessarily False drop rate, based on this, in a kind of preferable implementation, the above-mentioned multiple features enumerated can be combined to be identified The property of website is identified, and exemplarily, above-mentioned multiple features can be formed into a characteristic vector, be used according to this feature vector Default recognizer is calculated, and is normal website when result of calculation is more than preset value, is otherwise malicious websites, wherein, Default recognizer can with but be not limited to Bayesian Classification Arithmetic etc..

The recognition methods for the malicious websites that above-described embodiment provides, when website is identified, due to based on network address sheet The information of body is identified make it that the recognition efficiency of website is higher without loading webpage, on the basis of above-described embodiment, In order to further improve the recognition efficiency of website, the embodiments of the invention provide another implementation of the recognition methods of malicious websites Example, the embodiment can apply to the network equipment, referring to Fig. 3, showing the schematic flow sheet of the embodiment, the embodiment can To comprise the following steps：

Step 301：Obtain the network address to be identified of website to be identified.

Network address is to lead to the address of website, and it includes abundant information, such as domain-name information, access protocol information etc..

Step 302：Judge network address to be identified whether in the black and white lists of website.

In the present embodiment, website black and white lists can be previously generated, website blacklist includes the domain name of malicious websites, net White list of standing includes the domain name of normal website.

Specifically, the process of generation website black and white lists can include：Pass through the network equipment by default measurement period statistics Domain name access frequency, the more stable domain name of the access frequency in preset time period is generated into website white list, will be default The domain name generation website blacklist that access frequency is unstable in period.

It is understood that if access frequency of a certain domain name in preset time period is relatively stable, show the domain name For the domain name of normal website, if access frequency of a certain domain name in preset time period is unstable, for example, in preset time period Some interior or several measurement period access frequencys are higher, and relatively low or for 0 in other measurement period access frequencys, then show The domain name is the domain name of malicious websites.

In addition, it is necessary to explanation, because domain name change is frequent on network, in order to ensure the effective of website black and white lists Property, website black and white lists can be periodically updated.

Step 303a：If network address to be identified in the blacklist of website, determines that website to be identified is malicious websites.

Due to being the domain name of malicious websites in the blacklist of website, therefore, if the domain name of network address to be identified is in the black name in website Dan Zhong, then can determine that website to be identified is malicious websites.

Step 303b：If network address to be identified in the white list of website, determines that the website to be identified is normal website.

Due to being the domain name of normal website in the white list of website, therefore, if the domain name of network address to be identified is in the white name in website Dan Zhong, then can determine that website to be identified is normal website.

Quickly the property of website to be identified can be identified by website black and white lists.

Step 303c：If network address to be identified is not in the black and white lists of website, domain name and trust based on network address to be identified Set of domains, determine whether website to be identified is malicious websites according to similarity mode algorithm.

In a kind of possible implementation, it can be well-known list of websites to trust set of domains.Specifically, can first it obtain Domain name ranked list, the top n domain name in domain name ranked list is then generated into well-known list of websites, wherein, N can be according to reality Demand is set.

Further, the acquisition modes of domain name ranked list have a variety of, can be from some in a kind of possible implementation Website obtains, and presently, there are some websites that ranking is done exclusively for the visit capacity of each website, such as Alexa, can be from these nets Stand and download domain name ranked list, the list is after the domain name of numerous websites is carried out into ranking by the order of visit capacity from high to low Domain name list.It is understood that domain name in the top is the higher domain name of visit capacity, this kind of domain name is usually well-known website Domain name, therefore, the top n domain name in domain name ranked list can be formed well-known list of websites as trust set of domains.

Domain name and trust set of domains based on network address to be identified, determine that website to be identified is according to similarity mode algorithm The no specific implementation process for malicious websites may refer to the explanation of subsequent embodiment.

Step 304：If the property of website to be identified can not be determined according to similarity mode algorithm, based on network address to be identified Content, extract target signature from network address to be identified.

Step 305：Identify whether website to be identified is malicious websites according to target signature.

Identify whether website to be identified is that the detailed processes of malicious websites can be found in above-described embodiment according to target signature, This is not repeated.

The recognition methods of malicious websites provided in an embodiment of the present invention, it is contemplated that the recognition methods based on website black and white lists Identify that complexity is relatively low, recognition efficiency is higher, be primarily based on website black and white lists and identify whether website to be identified is malicious websites, In the property based on website black and white lists None- identified website to be identified, further taken second place using identification complexity, identification effect Also of a relatively high recognition methods based on similarity mode algorithm identifies the property of website to be identified to rate, according to similarity When can not determine the property of website to be identified with algorithm, then identified based on the target signature extracted from network address to be identified and wait to know Whether other website is malicious websites.As can be seen here, the recognition methods of malicious websites provided in an embodiment of the present invention, can use a variety of Website identification method realizes the identification to malicious websites, and identification process is based only on network address in itself, without by webpage In information, i.e., need not load webpage, this not only increases the discrimination of website, also substantially increase website identification effect Rate.

Below again to domain names of the step 303c based on network address to be identified and trust set of domains, according to similarity mode algorithm Determine whether website to be identified is that the specific implementation processes of malicious websites illustrates, referring to Fig. 4, showing step 303c's The schematic flow sheet of specific implementation process, it can include：

Step 401：The domain name of network address to be identified is calculated with trusting each similarity for trusting domain name in set of domains.

Step 402：Order based on similarity from high to low, determine that k target trusts domain name in set of domains from trusting.

Wherein, k domain name before k target trust domain is entitled with the similarity ranking of the domain name of network address to be identified, k be more than Positive integer equal to 1.

Exemplary, k 1, i.e., determined in target trust domain set maximum with the similarity of the domain name of network address to be identified Domain name, it is assumed that the domain name of network address to be identified is D₀, the domain name in trust domain set includes D₁、D₂、D₃、D₄、D₅、D₆, count respectively Calculate D₀With D₁、D₂、D₃、D₄、D₅、D₆Similarity, if being computed finding, D₃With D₀Similarity it is maximum, then by D₃It is defined as mesh Mark trusts domain name.

Exemplary, k 3, the domain name of network address to be identified is D₀, the domain name in trust domain set includes D₁、D₂、D₃、D₄、 D₅、D₆, D is calculated respectively₀With D₁、D₂、D₃、D₄、D₅、D₆Similarity, if being computed finding, with D₀The maximum domain name of similarity For D₃, next to that D₄、D₂、D₁、D₅, then, it may be determined that the entitled D of target trust domain₃、D₄、D₂。

Step 403：Judge that k target trusts to whether there is to be less than with the similarity of the domain name of network address to be identified in domain name to set Determine the domain name of threshold value.

Step 404：If k is trusted the domain for existing in domain name and being less than given threshold with the similarity of the domain name of network address to be identified Name, it is determined that website to be identified is malicious websites.

If k is trusted the domain name for being not present in domain name and being less than given threshold with the similarity of the domain name of network address to be identified, Perform step 304.

From said process, it can be one that target, which trusts domain name, or multiple.If target trust domain entitled one It is individual, then judge whether the similarity of itself and network address to be identified is less than given threshold, if it is, can determine that network address to be identified to dislike The network address of meaning website, i.e., website to be identified is malicious websites, also, website corresponding to target trust domain name is imitated for malicious websites The website emitted.If target trust domain is entitled multiple, judge that the domain name of network address to be identified trusts the phase of domain name with each target respectively Whether it is less than given threshold like degree, if the similarity that at least one target domain name and the domain name of network address to be identified be present is less than setting Threshold value, then can determine that the network address that network address to be identified is malicious websites, i.e., website to be identified is malicious websites, and, with net to be identified The similarity of the domain name of location is the counterfeit website of malicious websites less than website corresponding at least one target domain name of given threshold.

In a kind of possible implementation, it can determine that target trusts domain name using bed-tree algorithms, this method is by institute Some trust domain names store according to trie trees tissue, then therefrom search for the domain name most like with the domain name of network address to be identified, In the algorithm, similarity can be characterized by editing distance, i.e., in search procedure, calculate the domain name and trust domain of network address to be identified The editing distance of name, so that it is determined that the trust domain name minimum with the editing distance of the domain name of network address to be identified is as target trust domain Name, when being judged, that is, judge that whether the domain name of network address to be identified and target trust the editing distance of domain name less than setting threshold Value, if it is, can determine that website to be identified is malicious websites.

Exemplary, target trust domain is entitled " www.icbc.com.cn ", and the domain name of network address to be identified is " www.icbc.cmn.cn ", given threshold 3, due to the editor of " www.icbc.com.cn " and " www.icbc.cmn.cn " Distance is 1, and less than given threshold 3, therefore, can determine that website to be identified is malicious websites, and the domain name of the malicious websites is counterfeit Domain name www.icbc.com.cn.

Alternatively, when it is malicious websites to determine website to be identified, the counterfeit website of the also exportable malicious websites Network address or domain name.

The embodiment of the present invention also provides the identification device of corresponding malicious websites, referring to Fig. 5, showing the malicious websites Identification device a structural representation, can include：Acquisition module 501, the identification module of characteristic extracting module 502 and first 503。

Acquisition module 501, for obtaining the network address to be identified of website to be identified.

Characteristic extracting module 502, for the content based on network address to be identified, target signature is extracted from network address to be identified.

First identification module 503, for identifying whether website to be identified is malicious websites according to target signature.

In some possible implementations of the embodiment of the present invention, the first identification module 503 can include：Identify submodule Block, judging submodule and the first determination sub-module.

Submodule is identified, for identifying the content of network address to be identified, obtains the text feature in network address to be identified and/or association Discuss feature.

Judging submodule, for judging whether include preset characters in text feature and/or the protocol characteristic, wherein, Preset characters are the character for the network address that can determine that network address to be identified is malicious websites.

First determination sub-module, for judging to include in text feature and/or the protocol characteristic in judging submodule During preset characters, it is malicious websites to determine website to be identified.

In some possible implementations of the embodiment of the present invention, the first identification module 303 can also include：Statistics Module and the second determination sub-module.

Statistic submodule, for not including predetermined word in judging text feature and/or protocol characteristic in judging submodule Fu Shi, text feature and/or protocol characteristic are counted, obtain statistical nature.

Whether second determination sub-module, the statistical nature for counting to obtain according to statistic submodule determine website to be identified For malicious websites.

The identification device of malicious websites provided in an embodiment of the present invention, the abundant information of network address in itself is made full use of, due to Network address contains the information such as domain name, access protocol, path in itself, and therefore, these information that can be included based on network address are to be identified Text feature, protocol characteristic and/or statistical nature are extracted in the network address to be identified of website, and then waits to know according to these feature recognitions Whether other website is malicious websites, and the discrimination of this recognition methods is higher, also, because the recognition methods is based only on network address The information itself included is identified, and without being identified by the information in webpage, therefore, identification process need not add Contained network page, therefore, recognition efficiency is higher.

Referring to Fig. 6, showing another structural representation of the identification device of malicious websites, the device 60 can include： Acquisition module 601, characteristic extracting module 602, the first identification module 603, the second identification module 604 and the 3rd identification module 605.

Acquisition module 601, for obtaining the network address to be identified of website to be identified.

Second identification module 604, after obtaining website to be identified in acquisition module 601, determine that network address to be identified is It is no in the black and white lists of website, if network address to be identified in the blacklist of website, determines that website to be identified is malicious websites, if waiting to know For other network address in the white list of website, it is normal website to determine website to be identified.

3rd identification module 605, for determining network address to be identified not in the black and white lists of website in the second identification module 604 When, domain name based on network address to be identified and trust set of domains, according to similarity mode algorithm determine website to be identified whether be Malicious websites.

Characteristic extracting module 602, in the property of the None- identified of the 3rd identification module 603 website to be identified, being based on The content of network address to be identified, target signature is extracted from network address to be identified.

First identification module 603, for identifying whether website to be identified is malicious websites according to target signature.

It should be noted that the concrete structure of the first identification module 603 and its net to be identified is identified according to target signature Whether be the specific implementation process of malicious websites can be found in above-described embodiment to the explanation of first identification module 303, herein if standing Do not repeat.

The identification device of malicious websites provided in an embodiment of the present invention, obtain website to be identified network address to be identified it Afterwards, it is contemplated that the recognition methods identification complexity based on website black and white lists is relatively low, recognition efficiency is higher, based on website black and white During the property of list None- identified website to be identified, further taken second place using identification complexity, recognition efficiency is also of a relatively high Recognition methods based on similarity mode algorithm identifies the property of website to be identified, can not be determined according to similarity mode algorithm During the property of website to be identified, then based on the target signature that is extracted from network address to be identified identify whether website to be identified is evil Meaning website.The identification device of malicious websites provided in an embodiment of the present invention, a variety of website identification methods can be used to realize to malice The identification of website, and identification process is based only on network address in itself, without by the information in webpage, i.e., net need not be loaded Page, this not only increases the discrimination of website, also substantially increases the recognition efficiency of website.

In addition, the embodiment of the present invention also provides a kind of computer-readable recording medium, in the computer readable storage medium storing program for executing Instruction can be stored with, when the instruction is run on network devices so that the network equipment performs to be provided in the embodiment of the present invention Malicious websites recognition methods.

The embodiment of the present invention also provides a kind of computer program product, and the computer program product is run on network devices When, the network equipment can be caused to perform the recognition methods of malicious websites provided in an embodiment of the present invention.

Each embodiment is described by the way of progressive in this specification, what each embodiment stressed be and other The difference of embodiment, between each embodiment identical similar portion mutually referring to.

In several embodiments provided herein, it should be understood that disclosed method, apparatus and equipment, can be with Realize by another way.For example, device embodiment described above is only schematical, for example, the unit Division, only a kind of division of logic function, can there is other dividing mode, such as multiple units or component when actually realizing Another system can be combined or be desirably integrated into, or some features can be ignored, or do not perform.It is another, it is shown or The mutual coupling discussed or direct-coupling or communication connection can be by some communication interfaces, between device or unit Coupling or communication connection are connect, can be electrical, mechanical or other forms.

The unit illustrated as separating component can be or may not be physically separate, show as unit The part shown can be or may not be physical location, you can with positioned at a place, or can also be distributed to multiple On NE.Some or all of unit therein can be selected to realize the mesh of this embodiment scheme according to the actual needs 's.In addition, each functional unit in each embodiment of the present invention can be integrated in it is in a processing unit or each Unit is individually physically present, can also two or more units it is integrated in a unit.

If the function is realized in the form of SFU software functional unit and is used as independent production marketing or in use, can be with It is stored in a computer read/write memory medium.Based on such understanding, technical scheme is substantially in other words The part to be contributed to prior art or the part of the technical scheme can be embodied in the form of software product, the meter Calculation machine software product is stored in a storage medium, including some instructions are causing a computer equipment (can be People's computer, server, or network equipment etc.) perform all or part of step of each embodiment methods described of the present invention. And foregoing storage medium includes：USB flash disk, mobile hard disk, read-only storage (ROM, Read-Only Memory), arbitrary access are deposited Reservoir (RAM, RandomAccess Memory), magnetic disc or CD etc. are various can be with the medium of store program codes.

The foregoing description of the disclosed embodiments, professional and technical personnel in the field are enable to realize or using the present invention. A variety of modifications to these embodiments will be apparent for those skilled in the art, as defined herein General Principle can be realized in other embodiments without departing from the spirit or scope of the present invention.Therefore, it is of the invention The embodiments shown herein is not intended to be limited to, and is to fit to and principles disclosed herein and features of novelty phase one The most wide scope caused.

Claims

A kind of 1. recognition methods of malicious websites, it is characterised in that including：

Obtain the network address to be identified of website to be identified；

Based on the content of the network address to be identified, target signature is extracted from the network address to be identified；

Identify whether the website to be identified is malicious websites according to the target signature.
2. the recognition methods of malicious websites according to claim 1, it is characterised in that after obtaining the network address to be identified, Methods described also includes：

Determine the network address to be identified whether in the black and white lists of website；

If the network address to be identified in the blacklist of website, determines that the website to be identified is malicious websites；

If the network address to be identified in the white list of website, determines that the website to be identified is normal website；

If it is determined that the network address to be identified not in the website black and white lists, then execution is described based on the network address to be identified Content, target signature is extracted from the network address to be identified.
3. the recognition methods of malicious websites according to claim 2, it is characterised in that it is determined that the network address to be identified not After in the website black and white lists, the content based on the network address to be identified is performed, from the network address to be identified Before extracting target signature, methods described also includes：

Domain name and trust set of domains based on the network address to be identified, the net to be identified is determined according to similarity mode algorithm Whether stand is malicious websites.
4. the recognition methods of the malicious websites according to any one in claim 1-3, it is characterised in that described according to institute State target signature and identify whether the website to be identified is malicious websites, including：

Identify whether the website to be identified is malicious websites according to text feature, statistical nature and/or protocol characteristic.
5. the recognition methods of malicious websites according to claim 4, it is characterised in that it is described according to text feature and/or Protocol characteristic identifies whether the website to be identified is malicious websites, including：

The content of the network address to be identified is identified, obtains text feature and/or protocol characteristic in the network address to be identified；

Judge preset characters whether are included in the text feature and/or the protocol characteristic, the preset characters are that can determine that Network address to be identified is the character of the network address of malicious websites；

If the preset characters are included in the text feature and/or the protocol characteristic, it is determined that the website to be identified is Malicious websites.
6. the recognition methods of malicious websites according to claim 4, it is characterised in that methods described also includes：

If not including the preset characters in the text feature and the protocol characteristic, to the text feature and/or described Protocol characteristic is counted, and obtains the statistical nature；

Determine whether the website to be identified is malicious websites according to the statistical nature.
A kind of 7. identification device of malicious websites, it is characterised in that including：Acquisition module, characteristic extracting module and the first identification Module；

The acquisition module, for obtaining the network address to be identified of website to be identified；

The characteristic extracting module, for the content based on the network address to be identified, target is extracted from the network address to be identified Feature；

First identification module, for identifying whether the website to be identified is malicious websites according to the target signature.
8. the identification device of malicious websites according to claim 7, it is characterised in that first identification module, specifically For identifying whether the website to be identified is malicious websites according to text feature, statistical nature and/or protocol characteristic.
A kind of 9. computer-readable recording medium, it is characterised in that instruction is stored with the computer readable storage medium storing program for executing, when When the instruction is run on network devices so that the network equipment perform claim requires the malice net described in any one of 1-6 The recognition methods stood.
10. a kind of computer program product, it is characterised in that when the computer program product is run on network devices, make Obtain the recognition methods that the network equipment perform claim requires the malicious websites described in any one of 1-6.