CN109274632A - A kind of recognition methods of website and device - Google Patents

A kind of recognition methods of website and device Download PDF

Info

Publication number
CN109274632A
CN109274632A CN201710565741.8A CN201710565741A CN109274632A CN 109274632 A CN109274632 A CN 109274632A CN 201710565741 A CN201710565741 A CN 201710565741A CN 109274632 A CN109274632 A CN 109274632A
Authority
CN
China
Prior art keywords
website
url
high probability
url request
request
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710565741.8A
Other languages
Chinese (zh)
Other versions
CN109274632B (en
Inventor
付为民
郝建忠
郑浩彬
陈涛
邬学农
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Communications Group Co Ltd
China Mobile Group Guangdong Co Ltd
Original Assignee
China Mobile Communications Group Co Ltd
China Mobile Group Guangdong Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Communications Group Co Ltd, China Mobile Group Guangdong Co Ltd filed Critical China Mobile Communications Group Co Ltd
Priority to CN201710565741.8A priority Critical patent/CN109274632B/en
Publication of CN109274632A publication Critical patent/CN109274632A/en
Application granted granted Critical
Publication of CN109274632B publication Critical patent/CN109274632B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1441Countermeasures against malicious traffic
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/10Network architectures or network communication protocols for network security for controlling access to devices or network resources
    • H04L63/101Access control lists [ACL]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1416Event detection, e.g. attack signature detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Hardware Design (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The present invention implements to provide recognition methods and the device of a kind of website, the described method includes: receiving the uniform resource locator URL request that user accesses website, the corresponding URL of the URL request is searched in white list, if finding the corresponding URL of the URL request in the white list, the corresponding URL of the URL request is connected;The corresponding URL of the URL request is searched in blacklist, if finding the corresponding URL of the URL request in the blacklist, generates high risk prompt information;If not finding the corresponding URL of the URL request in the white list and the blacklist, each feature weight value of the corresponding URL of the URL request is then calculated according to preset rules, and identifies whether the corresponding URL of the URL request is abnormal website according to each feature weight value.The embodiment of the present invention, which realizes, quick and precisely efficiently identifies abnormal website, and significantly reduces the False Rate of system, and the user experience is improved.

Description

A kind of recognition methods of website and device
Technical field
The present invention relates to field of computer technology, and in particular to a kind of recognition methods of website and device.
Background technique
With the high speed development of mobile Internet, user browse web sites information approach from the single end PC increasingly It is changed into mobile end equipment more.On June 22nd, 2016, China Internet Network Information Center (CNNIC) are issued the 37th time in Beijing " China Internet network state of development statistical report ", report display: by December, 2015, China's netizen's scale up to 6.88 hundred million, Wherein for mobile phone users up to 6.20 hundred million, accounting is up to 90.12%.
The safety problem of cell phone client is also increasingly prominent at the same time, the active smart phone connection within the border of China in 2015 For network termination up to 11.3 hundred million, headed by " counterfeit, fishing website, rogue program " the problem of, is increasing, gives user's Internet Security Threat is brought, the loss of wealth or the leakage of personal information are caused.
Operator requests cell phone client at present uniform resource locator (Uniform Resoure Locator: with Lower abbreviation URL) it is mainly intercepted by blacklist mode in network side.
Blacklist method: at Wireless Application Protocol (Wireless Application Protocol: hereinafter referred to as WAP) Blacklist list is configured for WAP gateway, after mobile phone http request reaches WAP gateway, gateway analyzes hypertext transfer protocol URL in (HyperText Transfer Protocol, HTTP) head, and successively retrieval matching, if secondary URL is in blacklist Middle hit, then WAP gateway no longer proxy requests, are directly returned to mobile phone terminal 403, the denied access page.
The advantages of blacklist method: simple direct, the URL gateway of all hit blacklists all no longer carries out doing generation in next step Reason request can reduce the load of proxy gateway since proxy gateway is not necessarily to request to original server.Mobile phone terminal Directly obtain the page of 403 denied access (browser or application program app are presented).
Blacklist method disadvantage:
1, current blacklist is deployed in WAP gateway, needs user to act on behalf of in terminal setting 10.0.0.172, if be not provided with Agency, then user's surfing flow can not just be intercepted without WAP gateway.
According to statistics, in terminal side, setting 10.0.0.172 is not acted on behalf of current 90% or more user, the interception scheme pair This certain customers does not have effect.
2, blacklist interception mode, the page is too simple, will lead to user and takes for network failure, experiences poor.
User accesses illegal website, and majority is obtained from the push of illegal short message/mail/advertisement etc., and user itself is simultaneously The website for being unaware of oneself access is illegal/harmful/mistake.The processing mode of blacklist, effectively prevents user Access, but user obtains a too simple denied access page, and user will mistakenly believe that network or website service is deposited In problem, evaluation of the user to carrier network or website is reduced.In addition, which is easy to cause user to make repeated attempts visit It asks or client automatically attempts to access again, in this way increasing with " counterfeit, fishing website ", blacklist is also increasing, Excessive blacklist, which means to match every time, needs the longer time.Which increase the processing loads of proxy gateway, reduce agency Gateway processes efficiency, to reduce user's networking speed.
3, traditional blacklist interception mode requires the data precision very high, in order to ensure will not accidentally block normal website, It needs largely manually to go to audit one by one, take time and effort, it can not be for doubtful net of the number on entire internet in terms of tens It stands and is audited one by one.In addition, counterfeit, fishing website has the characteristics that domain name variation is frequent, similarity is high, timeliness is short, institute Current demand has been not suitable in a manner of traditional blacklist.
4, traditional blacklist interception mode cannot flexibly handle the doubtful website of the overwhelming majority, if black name is added It is single directly to intercept the complaint for easily causing website, if risk that is without any processing and being implicitly present in leakage customer privacy.
Therefore, traditional blacklist interception mode how is improved, quick and precisely efficiently identifies abnormal website as one Technical problem urgently to be resolved.
Summary of the invention
For the defects in the prior art, the embodiment of the invention provides a kind of recognition methods of website and devices.
In a first aspect, the embodiment of the invention provides a kind of recognition methods of website, which comprises
The uniform resource locator URL request that user accesses website is received, it is corresponding that the URL request is searched in white list URL connect the corresponding URL of the URL request if finding the corresponding URL of the URL request in the white list;
The corresponding URL of the URL request is searched in blacklist, if finding the URL request in the blacklist Corresponding URL then generates high risk prompt information;
If not finding the corresponding URL of the URL request in the white list and the blacklist, according to pre- If rule calculates each feature weight value of the corresponding URL of the URL request, and identifies institute according to each feature weight value State whether the corresponding URL of URL request is abnormal website.
Optionally, each feature weight value that the corresponding URL of the URL request is calculated according to preset rules, specifically Include:
It is similar that the domain name title similarity weight of the corresponding URL of the URL request, web page contents are calculated according to preset rules Spend weight, the weight of user's report amount, the feature weight value of secondary amount of access weight four dimensions.
Optionally, the abnormal website specifically includes:
High probability exception website, doubtful abnormal website and the normal website of high probability.
Optionally, the method also includes:
If the corresponding URL of the URL request is abnormal website, URL corresponding to the URL request carries out secondary knowledge Not;
If the result of the secondary identification is high probability exception website, high risk prompt information is generated, and with Track identifies high probability exception website, and secondary connection high probability exception website simultaneously counts secondary connection number, and adds High probability exception website is into the blacklist;
If the result of the secondary identification is the normal website of the high probability, it is directly connected to the normal net of the high probability It stands, and adds the normal website of the high probability into the white list;
If the result of the secondary identification is the doubtful abnormal website, average risk prompt information is generated, tracking is known The not described doubtful abnormal website, secondary connection high probability exception website simultaneously counts secondary connection number, and adds described doubt Like abnormal website into gray list.
Optionally, the method also includes:
According to each feedback information of user, crawl web page contents, update web page contents characteristic similarity value, the secondary visit in website The information that periodically updates for the amount of asking is iterated calculating identification to the blacklist, the white list and the gray list;
If recognition result is high probability exception website, it is added in the blacklist;
If recognition result is the normal website of the high probability, it is added in the white list;
If recognition result continues to retain neither high probability exception website is also not the normal website of the high probability It waits next iteration to calculate in the gray list, is identified.
Optionally, the calculation method of domain name title similarity weight includes:
Establish white list website domain name library;
The domain name and the domain name in white list website domain name library for comparing the corresponding URL of the URL request, judge whether There are common misspelling, vowel character substitution, unisonance allographs to replace, the top level domain of mistake is replaced, the second-level domain of mistake Name replacement, singular complex transform, shape similar word, missing or a certain character of repetition, adjacent character exchange position, keyboard adjacent character replace Content is deleted in generation or insertion, the insertion of separating character, obtains judging result;
According to the judging result, the domain name and the white list website domain name of the corresponding URL of the URL request are calculated The similarity score value of domain name in library, and the maximum value in the score value is obtained as the corresponding URL's of the URL request Domain name title similarity weight.
Second aspect, the embodiment of the invention provides a kind of identification device of website, described device includes:
White list processing unit accesses the uniform resource locator URL request of website for receiving user, in white list The corresponding URL of the URL request is searched, if finding the corresponding URL of the URL request in the white list, connects institute State the corresponding URL of URL request;
Blacklist processing unit, for searching the corresponding URL of the URL request in blacklist, if in the blacklist In find the corresponding URL of the URL request, then generate high risk prompt information;
Abnormal Website processing device, if for not finding the URL in the white list and the blacklist Corresponding URL is requested, then calculates each feature weight value of the corresponding URL of the URL request according to preset rules, and according to institute It states each feature weight value and identifies whether the corresponding URL of the URL request is abnormal website.
Optionally, the abnormal Website processing device specifically includes:
It is similar that the domain name title similarity weight of the corresponding URL of the URL request, web page contents are calculated according to preset rules Spend weight, the weight of user's report amount, the feature weight value of secondary amount of access weight four dimensions.
Optionally, the abnormal website specifically includes:
High probability exception website, doubtful abnormal website and the normal website of high probability.
Optionally, described device further include:
Secondary identification device, it is corresponding to the URL request if being abnormal website for the corresponding URL of the URL request URL be recognized;
High probability exception Website processing device, if the result for the secondary identification is high probability exception website, High risk prompt information is then generated, and tracks and identifies high probability exception website, the secondary connection high probability exception net It stands and counts secondary connection number, and add high probability exception website into the blacklist;
The normal Website processing device of high probability, if the result for the secondary identification is the normal website of the high probability, It is then directly connected to the normal website of the high probability, and adds the normal website of the high probability into the white list;
Doubtful exception Website processing device is given birth to if the result for the secondary identification is the doubtful abnormal website At average risk prompt information, the doubtful abnormal website is tracked and identified, secondary connection high probability exception website simultaneously counts Secondary connection number, and the doubtful abnormal website is added into gray list.
Optionally, described device further include:
Iterate to calculate device, for according to each feedback information of user, crawl web page contents, update web page contents feature phase The blacklist, the white list and the gray list are carried out like angle value, the information that periodically updates of the secondary amount of access in website Iterative calculation identification;
High probability exception website iteration means are added to institute if being high probability exception website for recognition result It states in blacklist;
The normal website iteration means of high probability are added to institute if being the normal website of the high probability for recognition result It states in white list;
Doubtful exception website iteration means, if for recognition result neither high probability exception website is also not described The normal website of high probability then remains in the gray list and next iteration is waited to calculate, identified.
Optionally, the computing device of domain name title similarity weight specifically includes:
Device is established in white list website, for establishing white list website domain name library;
Device is compared, in domain name and the white list website domain name library for comparing the corresponding URL of the URL request Domain name, judge whether there is common misspelling, vowel character substitution, unisonance allograph replacement, mistake top level domain replacement, Second level domain replacement, singular complex transform, shape similar word, missing or a certain character of repetition of mistake, adjacent character exchange position, key Content is deleted in disk adjacent character substitution or insertion, the insertion of separating character, obtains judging result;
Processing unit, for according to the judging result, calculate the domain name of the corresponding URL of the URL request with it is described The similarity score value of domain name in white list website domain name library, and the maximum value in the score value is obtained as the URL Request the domain name title similarity weight of corresponding URL.
The third aspect, the embodiment of the invention provides a kind of electronic equipment, the electronic equipment includes:
At least one processor;And
At least one processor being connect with the processor communication, in which:
The memory is stored with the program instruction that can be executed by the processor, and the processor calls described program to refer to Order is able to carry out above-mentioned corresponding either method.
Fourth aspect, the embodiment of the invention provides a kind of non-transient computer readable storage medium, the non-transient meter Calculation machine readable storage medium storing program for executing stores computer program, and the computer program makes the computer execute either above-mentioned correspondence Method.
The recognition methods of website provided in an embodiment of the present invention and device, by similar from domain name similarity, web page contents Degree, user's report information, the multiple dimensions of secondary amount of access of website carry out comprehensive analysis identification to abnormal website, and in this base The abnormal website recognizer model that a various dimensions comprehensive analysis is established on plinth carries out classification to website and realizes graded access Control, in hierarchical access control, user can according to the actual situation feed back site information, can also equally calculate for identification Method model provides important reference data, substantially increases the recognition accuracy of abnormal website, significantly reduces the mistake of system Sentence rate, the user experience is improved.
Detailed description of the invention
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is the present invention Some embodiments for those of ordinary skill in the art without creative efforts, can also basis These attached drawings obtain other attached drawings.
Fig. 1 is the flow diagram of the recognition methods of website in the embodiment of the present invention;
Fig. 2 is the flow chart of the recognition methods of another website in the embodiment of the present invention;
Fig. 3 is the structural schematic diagram of the identification device of website in the embodiment of the present invention;
Fig. 4 is the logic diagram of electronic equipment provided by one embodiment of the present invention.
Specific embodiment
In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction with the embodiment of the present invention In attached drawing, technical solution in the embodiment of the present invention is explicitly described, it is clear that described embodiment is the present invention A part of the embodiment, instead of all the embodiments.Based on the embodiments of the present invention, those of ordinary skill in the art are not having Every other embodiment obtained under the premise of creative work is made, shall fall within the protection scope of the present invention.
The embodiment of the invention provides a kind of recognition methods of website, Fig. 1 is the identification side of website in the embodiment of the present invention The flow diagram of method, as shown in Figure 1, which comprises
Step S101, the uniform resource locator URL request that user accesses website is received, in white list described in lookup The corresponding URL of URL request connects the URL request if finding the corresponding URL of the URL request in the white list Corresponding URL;
Wherein, the white list refers to, the concept of white list is corresponding with " blacklist ".Such as: in computer system, have Many softwares have been applied to black and white lists rule, operating system, firewall, antivirus software, mailing system, application software etc., All control aspects that is related to nearly all apply black and white lists rule.Use if setting up white list, in white list Family (or IP address, IP packet, mail etc.) can preferentially pass through, and will not be treated as spam rejection, safety and agility are all big It is big to improve.Its meaning is extended into a step, then all applications for having blacklist function, it is corresponding just to have function of white name list.
The URL request is the URL that the active user that website visiting user sends needs to link.
Step S102, the corresponding URL of the URL request is searched in blacklist, if finding institute in the blacklist The corresponding URL of URL request is stated, then generates high risk prompt information;
Wherein, the blacklist refers to, blacklist enable after, be formulated into blacklist user (or IP address, IP packet, Mail, virus etc.) it cannot pass through.
If it is corresponding step S103, not find the URL request in the white list and the blacklist URL then calculates each feature weight value of the corresponding URL of the URL request according to preset rules, and according to each feature Weighted value identifies whether the corresponding URL of the URL request is abnormal website.
Wherein, the weight is an opposite concept, and for a certain index, the weight of a certain index refers to that this refers to The relative importance being marked in the overall evaluation.Weight is that weight is separated from several evaluation indexes, one group of evaluation index The corresponding weight of system constitutes proportional system.
The exception website is a kind of Email deception website of identity for being intended to steal you, in abnormal website means In, fraud sponsor attempts to gain your trust by false pretense by cheating, so that you be made to reveal valuable personal data, such as credit Card number, password, account data or other information;Abnormal website further includes the websites such as pornographic website, trojan horse download link, different Normal website means by phone or short message or can pass through spam or pop-up window canbe used on line.
The recognition methods of exception website provided in an embodiment of the present invention calculates the corresponding each spy in website by preset rules Weighted value is levied, operation obtains each feature weight value and judges the website for the probability of abnormal website, and the embodiment of the present invention is significantly The recognition accuracy for improving abnormal website significantly reduces the False Rate of system, and the user experience is improved.In the above method On the basis of embodiment, each feature weight value that the corresponding URL of the URL request is calculated according to preset rules, specifically Include:
It is similar that the domain name title similarity weight of the corresponding URL of the URL request, web page contents are calculated according to preset rules Spend weight, the weight of user's report amount, the feature weight value of secondary amount of access weight four dimensions.
Wherein, the calculation method of domain name title similarity weight are as follows:
The first step establishes a common white name website domain name library, including common operator, bank, electric business, public security organs Website;
Second step compares the domain name in the domain name and white list of URL to be detected through row one by one, if there are common spellings Mistake, vowel character substitution, the replacement of unisonance allograph, the top level domain replacement of mistake, the second level domain replacement of mistake, odd number are multiple Transformation of variables, shape similar word, missing repeat a certain character, adjacent character exchange position, keyboard adjacent character substitution or insertion, separate The contents such as the insertion or deletion of character;
Third step, the domain name similarity that the domain name and each domain name in white list are calculated according to the result of second step obtain Score value, and take domain name similarity score value of the maximum value therein as the domain name.
The calculation method of the web page contents similarity weight are as follows:
The first step establishes web page contents feature database in a common white list website, and feature includes: title, key Word, picture etc., such as the web page contents feature database of www.10086.cn, www.ccb.com;
Second step, the web page contents feature that doubtful abnormal website is crawled by crawler technology;
Wherein, the crawler refers to, (be otherwise known as web crawlers webpage spider, network robot, in the community FOAF Between, it is more frequent to be known as webpage follower), be it is a kind of according to certain rules, automatically grab web message program or Person's script.There are also ant, automatic indexing, simulation program or worms for the rarely needed name of other.
The working principle of crawler technology refers to that web crawlers is the program for automatically extracting webpage, it is search engine The support grid page above and below WWW, is the important composition of search engine, and traditional crawler opens from the URL of one or several Initial pages Begin, the URL obtained on Initial page constantly extracts new URL from current page and be put into team during grabbing webpage Column, certain stop condition until meeting system.The workflow of focused crawler is complex, needs according to certain webpage point Analysis algorithm filtering is unrelated with theme to be linked, and the URL queue to be captured such as retains useful link and put it into.Then, it The webpage URL to be grabbed in next step will be selected from queue according to certain search strategy, and is repeated the above process, until reaching Stop when a certain condition of system.In addition, all webpages by crawler capturing will be stored by system, carry out certain analysis, Filtering, and index is established, so as to inquiry and retrieval later;For focused crawler, the obtained analysis knot of this process Fruit is also possible to provide feedback and guidance to later crawl process.
Third step is taken out the analysis of domain name similarity algorithm from white list feature database and is obtained and doubtful abnormal website similarity Highest white list domain name characteristic information calculates the web page contents characteristic similarity value of doubtful abnormal website.
The calculation method of the weight of the user's report amount:
The first step counts the quantity that the website pulls in blacklist by user's report for abnormal website or by user;
Second step, count the website by user's report be normal website or by user complaint be white list quantity;
Third step, according to statistical value, calculate the user's report information characteristics score value of the website.
The calculation method of the secondary amount of access weight are as follows:
The website is counted to be prompted to calculate the secondary access measure feature of website there are the secondary amount of access after risk and accounting Score value.
The model of abnormal website recognizer is exactly domain name similarity score value, the web page contents similarity of the comprehensive URL Score value, the secondary amount of access feature score value of user's report information characteristics score value and website are carried out according to different weights Decision judgement, finally determines whether the URL is counterfeit URL.
On the basis of above method embodiment, the exception website is specifically included:
High probability exception website, doubtful abnormal website and the normal website of high probability.
Wherein, high probability exception website refers to each spy being calculated according to above-mentioned preset rules as its name suggests Sign weighted value is known as the abnormal website of very big possibility, dangerous website.
The normal website of high probability refers to and is known as according to each feature weight value that above-mentioned preset rules are calculated The abnormal website of very little possibility, not dangerous website.
The doubtful abnormal website, refers to and is known as not according to each feature weight value that above-mentioned preset rules are calculated Determine the abnormal website of possibility, there are also to be calculated and investigations for risk.
On the basis of above method embodiment, the method also includes:
If the corresponding URL of the URL request is abnormal website, URL corresponding to the URL request carries out secondary knowledge Not;
Wherein, the secondary identification refer to first time recognition result be high probability exception website and doubtful abnormal website into Line trace identification lets pass when user carries out secondary access and counts secondary amount of access, and to described two websites according to default Rule is iterated calculating, carries out the calculating of feature weight value, obtains the result of judgement identification again.
If the result of the secondary identification is high probability exception website, high risk prompt information is generated, and with Track identifies high probability exception website, and secondary connection high probability exception website simultaneously counts secondary connection number, and adds High probability exception website is into the blacklist;
If the result of the secondary identification is the normal website of the high probability, it is directly connected to the normal net of the high probability It stands, and adds the normal website of the high probability into the white list;
If the result of the secondary identification is the doubtful abnormal website, average risk prompt information is generated, tracking is known The not described doubtful abnormal website, secondary connection high probability exception website simultaneously counts secondary connection number, and adds described doubt Like abnormal website into gray list.
On the basis of above method embodiment, the method also includes:
According to each feedback information of user, crawl web page contents, update web page contents characteristic similarity value, the secondary visit in website The information that periodically updates for the amount of asking is iterated calculating identification to the blacklist, the white list and the gray list;
If recognition result is high probability exception website, it is added in the blacklist;
If recognition result is the normal website of the high probability, it is added in the white list;
If recognition result continues to retain neither high probability exception website is also not the normal website of the high probability It waits next iteration to calculate in the gray list, is identified.
The recognition methods of website provided in an embodiment of the present invention passes through doubtful abnormal website gray list, the normal net of high probability Standing, (URL that the library pseudo-base station URL, mobile phone Malware chained library, customer service are collected is black for white list and high probability exception website blacklist List data library), the mechanism of a three-level access control and user feedback is established, according to user feedback, the secondary amount of access in website Etc. information it is continuous to URL in doubtful abnormal website gray list, high probability exception website blacklist, the normal website white list of high probability It is iterated calculating to update, is effectively reduced the False Rate of system, the user experience is improved.
On the basis of above method embodiment, the calculation method of domain name title similarity weight includes:
Establish white list website domain name library;
The domain name and the domain name in white list website domain name library for comparing the corresponding URL of the URL request, judge whether There are common misspelling, vowel character substitution, unisonance allographs to replace, the top level domain of mistake is replaced, the second-level domain of mistake Name replacement, singular complex transform, shape similar word, missing or a certain character of repetition, adjacent character exchange position, keyboard adjacent character replace Content is deleted in generation or insertion, the insertion of separating character, obtains judging result;
According to the judging result, the domain name and the white list website domain name of the corresponding URL of the URL request are calculated The similarity score value of domain name in library, and the maximum value in the score value is obtained as the corresponding URL's of the URL request Domain name title similarity weight.
The recognition methods of website provided in an embodiment of the present invention, by domain name title similarity analysis algorithm from common spelling Mistake, vowel character substitution, the replacement of unisonance allograph, the top level domain replacement of mistake, the second level domain replacement of mistake, odd number are multiple Transformation of variables, shape similar word, missing repeat a certain character, adjacent character exchange position, keyboard adjacent character substitution or insertion, separate 16 angles such as insertion or deletion of character analyze domain name comprehensively, and recognition accuracy is high.
The specific embodiment of the embodiment of the present invention are as follows:
Fig. 2 is the flow chart of the recognition methods of another website in the embodiment of the present invention, as shown in Fig. 2, the method has Body includes:
The first step carries out white list filtering to the URL request that user submits for the first time, determines to be normal website if hit, It directly puts logical;
If second step, miss white list, which is carried out blacklist filtering, and (blacklist library is main are as follows: pseudo-base station URL Library, mobile phone Malware chained library, customer service collect URL black list database, from extension horse report platform on by crawler acquisition By the website library of extension horse), high risk prompt is carried out to user if hit;
If third step, miss blacklist, calculates domain name title similarity weight, web page contents similarity weight, uses The characteristic value of the four dimensions such as the weight of family report amount, the secondary amount of access weight in website, with abnormal website recognizer model pair Website is classified, and high probability exception website, doubtful abnormal website, the normal website of high probability are divided into;
4th step, the abnormal website for high probability carry out high risk prompt to user;And the website is tracked and is known Not, to let pass in the secondary access of user and to count secondary amount of access;And it is the abnormal website of website deposit high probability is black List library;Website normal for high probability is then determined as normal website, directly puts logical, and it is white to be added to the normal website of high probability List library;For doubtful abnormal website, average risk prompt is carried out to user, and is added to the gray list of doubtful abnormal website Library tracks and identifies the website, to let pass in the secondary access of user and to count secondary amount of access;
5th step, the abnormal website blacklist library for high probability, doubtful abnormal website gray list library, the normal net of high probability It stands white list library, is fed back every time according to user, periodically crawls web page contents and update web page contents characteristic similarity value, website The information such as periodically update of secondary amount of access are iterated calculating identification to it, are high probability exception website for recognition result Continue be stored in high probability abnormal website blacklist library;It is normal for the deposit probability of the normal website of high probability for recognition result Website white list library;Other websites, which remain in doubtful abnormal website gray list library, waits next iteration to calculate.
The recognition methods of website provided in an embodiment of the present invention, by from domain name similarity, web page contents similarity, user Report information, website secondary amount of access multiple dimensions comprehensive analysis identification is carried out to abnormal website, and build on this basis The abnormal website recognizer model for having found a various dimensions comprehensive analysis carries out classification to website and realizes hierarchical access control, In hierarchical access control, user can according to the actual situation feed back site information, equally can also be recognizer model Important reference data is provided, the recognition accuracy of abnormal website is substantially increased, the False Rate of system is significantly reduced, mentions User experience is risen.
The embodiment of the invention provides a kind of identification device of website, Fig. 3 is the identification dress of website in the embodiment of the present invention The structural schematic diagram set, as shown in figure 3, described device includes: white list processing unit 301, blacklist processing unit 302 and different Normal Website processing device 303;Wherein,
White list processing unit 301 is used to receive the uniform resource locator URL request that user accesses website, in white list The middle corresponding URL of the URL request that searches is connected if finding the corresponding URL of the URL request in the white list The corresponding URL of the URL request;Blacklist processing unit 302 is used to search the corresponding URL of the URL request in blacklist, If finding the corresponding URL of the URL request in the blacklist, high risk prompt information is generated;At abnormal website If reason device 303 is used to not find the corresponding URL of the URL request in the white list and the blacklist, Each feature weight value of the corresponding URL of the URL request is calculated according to preset rules, and according to each feature weight value Identify whether the corresponding URL of the URL request is abnormal website.
The identification device of website provided in an embodiment of the present invention, by abnormal Website processing device, according to preset rules meter The corresponding each feature weight value in website is calculated, operation obtains each feature weight value and judges the website for the general of abnormal website Rate, the embodiment of the present invention substantially increase the recognition accuracy of abnormal website, significantly reduce the False Rate of system, improve User experience.
It is described to calculate each of the corresponding URL of the URL request according to preset rules on the basis of above method embodiment A feature weight value, specifically includes:
It is similar that the domain name title similarity weight of the corresponding URL of the URL request, web page contents are calculated according to preset rules Spend weight, the weight of user's report amount, the feature weight value of secondary amount of access weight four dimensions.
On the basis of above method embodiment, the exception website is specifically included:
High probability exception website, doubtful abnormal website and the normal website of high probability.
Optionally, described device further include: secondary identification device, high probability exception Website processing device, high probability are normal Website processing device and doubtful abnormal Website processing device;Wherein
It is corresponding to the URL request if secondary identification device is abnormal website for the corresponding URL of the URL request URL be recognized;If high probability exception Website processing device is the high probability for the result of the secondary identification Abnormal website then generates high risk prompt information, and tracks and identifies high probability exception website, and secondary connection is described high general Rate exception website simultaneously counts secondary connection number, and adds high probability exception website into the blacklist;High probability is just If normal Website processing device is the normal website of the high probability for the result of the secondary identification, it is directly connected to described high general The normal website of rate, and the normal website of the high probability is added into the white list;If doubtful exception Website processing device is used for The result of the secondary identification is the doubtful abnormal website, then generates average risk prompt information, track and identify described doubtful Abnormal website, secondary connection high probability exception website simultaneously count secondary connection number, and add the doubtful abnormal website Into gray list.
On the basis of above method embodiment, described device further include: iterative calculation device, high probability exception website change For device, the normal website iteration means of high probability and doubtful abnormal website iteration means;Wherein,
Iterative calculation device is used for according to each feedback information of user, crawls web page contents, update web page contents feature phase The blacklist, the white list and the gray list are carried out like angle value, the information that periodically updates of the secondary amount of access in website Iterative calculation identification;If high probability exception website iteration means are high probability exception website for recognition result, add Into the blacklist;If the normal website iteration means of high probability are the normal website of the high probability for recognition result, add It is added in the white list;If doubtful exception website iteration means are used for recognition result neither high probability exception website It is not the normal website of the high probability, then remains in the gray list and next iteration is waited to calculate, identified.
The identification device of website provided in an embodiment of the present invention passes through doubtful abnormal website gray list, the normal net of high probability Standing, (URL that the library pseudo-base station URL, mobile phone Malware chained library, customer service are collected is black for white list and high probability exception website blacklist List data library) iterative calculation device, establish the mechanism of a three-level access control and user feedback, according to user feedback, The information such as the secondary amount of access in website are to doubtful abnormal website gray list, high probability exception website blacklist, the normal website of high probability URL is constantly iterated calculating and updates in white list, is effectively reduced the False Rate of system, the user experience is improved.
On the basis of above method embodiment, the computing device of domain name title similarity weight includes: white list Device, comparison device and processing unit are established in website;Wherein,
Device is established for establishing white list website domain name library in white list website;Comparison device is asked for comparing the URL The domain name and the domain name in white list website domain name library for seeking corresponding URL, judge whether there is common misspelling, vowel It is character substitution, the replacement of unisonance allograph, the top level domain replacement of mistake, the second level domain replacement of mistake, singular complex transform, same Shape word, missing or repeat a certain character, adjacent character exchange position, keyboard adjacent character substitution or insertion, separating character is inserted Enter or delete content, obtains judging result;Processing unit is used to that it is corresponding to calculate the URL request according to the judging result URL domain name and the domain name in white list website domain name library similarity score value, and obtain in the score value Domain name title similarity weight of the maximum value as the corresponding URL of the URL request.
The identification device of website provided in an embodiment of the present invention, by domain name title similarity analysis computing device from common Misspelling, vowel character substitution, the replacement of unisonance allograph, the top level domain replacement of mistake, the second level domain replacement of mistake, list Number complex transform, shape similar word, missing or a certain character of repetition, adjacent character exchange position, keyboard adjacent character substitution or insertion, 16 angles such as insertion or deletion of separating character analyze domain name comprehensively, and recognition accuracy is high.
The identification device of website provided in an embodiment of the present invention is for realizing the knowledge of website provided in an embodiment of the present invention Other method, specific embodiment specifically states that details are not described herein in above method embodiment.
The identification device of website provided in an embodiment of the present invention, by from domain name similarity, web page contents similarity, user Report information, website secondary amount of access multiple dimensions comprehensive analysis identification is carried out to abnormal website, and build on this basis The abnormal website recognizer model for having found a various dimensions comprehensive analysis carries out classification to website and realizes hierarchical access control, In hierarchical access control, user can according to the actual situation feed back site information, equally can also be recognizer model Important reference data is provided, the recognition accuracy of abnormal website is substantially increased, the False Rate of system is significantly reduced, mentions User experience is risen.
Fig. 4 is the logic diagram of electronic equipment provided by one embodiment of the present invention, as shown in figure 4, the electronic equipment, It include: processor (processor) 401, memory (memory) 402 and bus 403;
Wherein, the processor 401 and memory 402 complete mutual communication by the bus 403;The processing Device 401 is used to call the program instruction in the memory 402, to execute method provided by above-mentioned each method embodiment.
The present embodiment discloses a kind of computer program product, and the computer program product includes being stored in non-transient calculating Computer program on machine readable storage medium storing program for executing, the computer program include program instruction, when described program instruction is calculated When machine executes, computer is able to carry out method provided by above-mentioned each method embodiment.
The present embodiment provides a kind of non-transient computer readable storage medium, the non-transient computer readable storage medium Computer instruction is stored, the computer instruction makes the computer execute method provided by above-mentioned each method embodiment.
Finally, it should be noted that the above various embodiments is only to illustrate the technical solution of the embodiment of the present invention, rather than it is right It is limited;Although the embodiment of the present invention is described in detail referring to foregoing embodiments, the ordinary skill of this field Personnel are it is understood that it is still possible to modify the technical solutions described in the foregoing embodiments, or to part Or all technical features are equivalently replaced;And these are modified or replaceed, it does not separate the essence of the corresponding technical solution The range of each embodiment technical solution of the embodiment of the present invention.

Claims (9)

1. a kind of recognition methods of website, which is characterized in that the described method includes:
The uniform resource locator URL request that user accesses website is received, it is corresponding that the URL request is searched in white list URL connects the corresponding URL of the URL request if finding the corresponding URL of the URL request in the white list;
The corresponding URL of the URL request is searched in blacklist, if it is corresponding to find the URL request in the blacklist URL, then generate high risk prompt information;
If not finding the corresponding URL of the URL request in the white list and the blacklist, according to default rule Each feature weight value of the corresponding URL of the URL request is then calculated, and according to each feature weight value identification Whether the corresponding URL of URL request is abnormal website.
2. the method according to claim 1, wherein described corresponding according to the preset rules calculating URL request URL each feature weight value, specifically include:
Domain name title similarity weight, the web page contents similarity power of the corresponding URL of the URL request are calculated according to preset rules Weight, the weight of user's report amount, the feature weight value of secondary amount of access weight four dimensions.
3. the method according to claim 1, wherein the exception website specifically includes:
High probability exception website, doubtful abnormal website and the normal website of high probability.
4. according to the method described in claim 3, it is characterized in that, the method also includes:
If the corresponding URL of the URL request is abnormal website, URL corresponding to the URL request is recognized;
If the result of the secondary identification is high probability exception website, high risk prompt information is generated, and track knowledge Not high probability exception website, secondary connection high probability exception website simultaneously count secondary connection number, and described in addition High probability exception website is into the blacklist;
If the result of the secondary identification is the normal website of the high probability, it is directly connected to the normal website of the high probability, and The normal website of the high probability is added into the white list;
If the result of the secondary identification is the doubtful abnormal website, average risk prompt information is generated, institute is tracked and identified Doubtful abnormal website is stated, secondary connection high probability exception website simultaneously counts secondary connection number, and adds described doubtful different Normal website is into gray list.
5. according to the method described in claim 4, it is characterized in that, the method also includes:
According to each feedback information of user, crawl web page contents, update web page contents characteristic similarity value, the secondary amount of access in website Periodically update information to the blacklist, the white list and the gray list be iterated calculating identification;
If recognition result is high probability exception website, it is added in the blacklist;
If recognition result is the normal website of the high probability, it is added in the white list;
If recognition result remains in institute neither high probability exception website is also not the normal website of the high probability Stating in gray list waits next iteration to calculate, and is identified.
6. according to the method described in claim 2, it is characterized in that, the calculation method packet of domain name title similarity weight It includes:
Establish white list website domain name library;
The domain name and the domain name in white list website domain name library for comparing the corresponding URL of the URL request, judge whether there is Common misspelling, vowel character substitution, unisonance allograph are replaced, the top level domain of mistake is replaced, the second level domain of mistake replaces Change, singular complex transform, shape similar word, missing or repeat a certain character, adjacent character exchange position, keyboard adjacent character substitution or Content is deleted in insertion, the insertion of separating character, obtains judging result;
According to the judging result, calculate in domain name and the white list website domain name library of the corresponding URL of the URL request Domain name similarity score value, and obtain domain name of the maximum value as the corresponding URL of the URL request in the score value Title similarity weight.
7. a kind of identification device of website, which is characterized in that described device includes:
White list processing unit accesses the uniform resource locator URL request of website for receiving user, searches in white list The corresponding URL of the URL request connects the URL if finding the corresponding URL of the URL request in the white list Request corresponding URL;
Blacklist processing unit, for searching the corresponding URL of the URL request in blacklist, if being looked into the blacklist The corresponding URL of the URL request is found, then generates high risk prompt information;
Abnormal Website processing device, if for not finding the URL request in the white list and the blacklist Corresponding URL then calculates each feature weight value of the corresponding URL of the URL request according to preset rules, and according to described each A feature weight value identifies whether the corresponding URL of the URL request is abnormal website.
8. a kind of electronic equipment characterized by comprising
At least one processor;And
At least one processor being connect with the processor communication, in which:
The memory is stored with the program instruction that can be executed by the processor, and the processor calls described program to instruct energy Enough methods executed as described in claim 1 to 6 is any.
9. a kind of non-transient computer readable storage medium, which is characterized in that the non-transient computer readable storage medium is deposited Computer program is stored up, the computer program makes the computer execute the method as described in claim 1 to 6 is any.
CN201710565741.8A 2017-07-12 2017-07-12 Website identification method and device Active CN109274632B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710565741.8A CN109274632B (en) 2017-07-12 2017-07-12 Website identification method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710565741.8A CN109274632B (en) 2017-07-12 2017-07-12 Website identification method and device

Publications (2)

Publication Number Publication Date
CN109274632A true CN109274632A (en) 2019-01-25
CN109274632B CN109274632B (en) 2021-05-11

Family

ID=65147708

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710565741.8A Active CN109274632B (en) 2017-07-12 2017-07-12 Website identification method and device

Country Status (1)

Country Link
CN (1) CN109274632B (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109831465A (en) * 2019-04-12 2019-05-31 重庆天蓬网络有限公司 A kind of invasion detection method based on big data log analysis
CN110069693A (en) * 2019-04-29 2019-07-30 百度在线网络技术(北京)有限公司 Method and apparatus for determining target pages
CN111147490A (en) * 2019-12-26 2020-05-12 中国科学院信息工程研究所 Directional fishing attack event discovery method and device
CN111756728A (en) * 2020-06-23 2020-10-09 深圳前海微众银行股份有限公司 Vulnerability attack detection method and device
CN112256988A (en) * 2020-10-19 2021-01-22 中国互联网金融协会 Method and device for monitoring cross-border house-buying website, electronic equipment and storage medium
CN112417329A (en) * 2020-10-19 2021-02-26 中国互联网金融协会 Method and device for monitoring illegal internet foreign exchange deposit transaction platform
CN112733057A (en) * 2020-11-27 2021-04-30 杭州安恒信息安全技术有限公司 Network content security detection method, electronic device and storage medium
CN112948725A (en) * 2021-03-02 2021-06-11 北京六方云信息技术有限公司 Phishing website URL detection method and system based on machine learning
CN114389854A (en) * 2021-12-22 2022-04-22 杭州美创科技有限公司 Malicious e-mail detection method and system
CN115801455A (en) * 2023-01-31 2023-03-14 北京微步在线科技有限公司 Website fingerprint-based counterfeit website detection method and device
CN116366338A (en) * 2023-03-30 2023-06-30 北京微步在线科技有限公司 Risk website identification method and device, computer equipment and storage medium
CN116846668A (en) * 2023-07-28 2023-10-03 北京中睿天下信息技术有限公司 Harmful URL detection method, system, equipment and storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101035128A (en) * 2007-04-18 2007-09-12 大连理工大学 Three-folded webpage text content recognition and filtering method based on the Chinese punctuation
US7854001B1 (en) * 2007-06-29 2010-12-14 Trend Micro Incorporated Aggregation-based phishing site detection
US8544090B1 (en) * 2011-01-21 2013-09-24 Symantec Corporation Systems and methods for detecting a potentially malicious uniform resource locator
CN103428186A (en) * 2012-05-24 2013-12-04 中国移动通信集团公司 Method and device for detecting phishing website
CN103607385A (en) * 2013-11-14 2014-02-26 北京奇虎科技有限公司 Method and apparatus for security detection based on browser
CN106209488A (en) * 2015-04-28 2016-12-07 北京瀚思安信科技有限公司 For detecting the method and apparatus that website is attacked
CN106603490A (en) * 2016-11-10 2017-04-26 上海斐讯数据通信技术有限公司 Phishing website detecting method and system

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101035128A (en) * 2007-04-18 2007-09-12 大连理工大学 Three-folded webpage text content recognition and filtering method based on the Chinese punctuation
US7854001B1 (en) * 2007-06-29 2010-12-14 Trend Micro Incorporated Aggregation-based phishing site detection
US8544090B1 (en) * 2011-01-21 2013-09-24 Symantec Corporation Systems and methods for detecting a potentially malicious uniform resource locator
CN103428186A (en) * 2012-05-24 2013-12-04 中国移动通信集团公司 Method and device for detecting phishing website
CN103607385A (en) * 2013-11-14 2014-02-26 北京奇虎科技有限公司 Method and apparatus for security detection based on browser
CN106209488A (en) * 2015-04-28 2016-12-07 北京瀚思安信科技有限公司 For detecting the method and apparatus that website is attacked
CN106603490A (en) * 2016-11-10 2017-04-26 上海斐讯数据通信技术有限公司 Phishing website detecting method and system

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109831465A (en) * 2019-04-12 2019-05-31 重庆天蓬网络有限公司 A kind of invasion detection method based on big data log analysis
CN110069693A (en) * 2019-04-29 2019-07-30 百度在线网络技术(北京)有限公司 Method and apparatus for determining target pages
CN110069693B (en) * 2019-04-29 2021-12-24 百度在线网络技术(北京)有限公司 Method and device for determining target page
CN111147490A (en) * 2019-12-26 2020-05-12 中国科学院信息工程研究所 Directional fishing attack event discovery method and device
CN111756728B (en) * 2020-06-23 2021-08-17 深圳前海微众银行股份有限公司 Vulnerability attack detection method and device, computing equipment and storage medium
CN111756728A (en) * 2020-06-23 2020-10-09 深圳前海微众银行股份有限公司 Vulnerability attack detection method and device
CN112256988A (en) * 2020-10-19 2021-01-22 中国互联网金融协会 Method and device for monitoring cross-border house-buying website, electronic equipment and storage medium
CN112417329A (en) * 2020-10-19 2021-02-26 中国互联网金融协会 Method and device for monitoring illegal internet foreign exchange deposit transaction platform
CN112733057A (en) * 2020-11-27 2021-04-30 杭州安恒信息安全技术有限公司 Network content security detection method, electronic device and storage medium
CN112948725A (en) * 2021-03-02 2021-06-11 北京六方云信息技术有限公司 Phishing website URL detection method and system based on machine learning
CN114389854A (en) * 2021-12-22 2022-04-22 杭州美创科技有限公司 Malicious e-mail detection method and system
CN115801455A (en) * 2023-01-31 2023-03-14 北京微步在线科技有限公司 Website fingerprint-based counterfeit website detection method and device
CN115801455B (en) * 2023-01-31 2023-05-26 北京微步在线科技有限公司 Method and device for detecting counterfeit website based on website fingerprint
CN116366338A (en) * 2023-03-30 2023-06-30 北京微步在线科技有限公司 Risk website identification method and device, computer equipment and storage medium
CN116366338B (en) * 2023-03-30 2024-02-06 北京微步在线科技有限公司 Risk website identification method and device, computer equipment and storage medium
CN116846668A (en) * 2023-07-28 2023-10-03 北京中睿天下信息技术有限公司 Harmful URL detection method, system, equipment and storage medium

Also Published As

Publication number Publication date
CN109274632B (en) 2021-05-11

Similar Documents

Publication Publication Date Title
CN109274632A (en) A kind of recognition methods of website and device
Rao et al. Jail-Phish: An improved search engine based phishing detection system
CN104954372B (en) A kind of evidence obtaining of fishing website and verification method and system
US11722520B2 (en) System and method for detecting phishing events
Ding et al. A keyword-based combination approach for detecting phishing webpages
CN104217160B (en) A kind of Chinese detection method for phishing site and system
CN103559235B (en) A kind of online social networks malicious web pages detection recognition methods
WO2016201938A1 (en) Multi-stage phishing website detection method and system
CN109690547A (en) For detecting the system and method cheated online
CN109831459B (en) Method, device, storage medium and terminal equipment for secure access
CN106330849A (en) Method and device for preventing domain name hijack
CN110099059A (en) A kind of domain name recognition methods, device and storage medium
CN108712426A (en) Reptile recognition methods and system a little are buried based on user behavior
Marchal et al. PhishScore: Hacking phishers' minds
CN103067387B (en) A kind of anti-phishing monitoring system and method
CN109768992A (en) Webpage malicious scanning processing method and device, terminal device, readable storage medium storing program for executing
CN107592305A (en) A kind of anti-brush method and system based on elk and redis
CN107800686A (en) A kind of fishing website recognition methods and device
CN110213234A (en) Developer's recognition methods, device, equipment and the storage medium of application file
CN104933069A (en) Method and system for analyzing web browsing statistics of desktop terminal
CN110474889A (en) One kind being based on the recognition methods of web graph target fishing website and device
CN111049837A (en) Malicious website identification and interception technology based on communication operator network transport layer
CN105653941A (en) Heuristic detection method and system for phishing website
CA3100237A1 (en) System and method for digitally finderprinting phishing actors
Luo et al. Botgraph: Web bot detection based on sitemap

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant