CN104156490A - Method and device for detecting suspicious fishing webpage based on character recognition - Google Patents

Method and device for detecting suspicious fishing webpage based on character recognition Download PDF

Info

Publication number
CN104156490A
CN104156490A CN201410439878.5A CN201410439878A CN104156490A CN 104156490 A CN104156490 A CN 104156490A CN 201410439878 A CN201410439878 A CN 201410439878A CN 104156490 A CN104156490 A CN 104156490A
Authority
CN
China
Prior art keywords
webpage
web page
word
page picture
picture
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410439878.5A
Other languages
Chinese (zh)
Inventor
李晓波
尹露
杨晶
郭峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Qihoo Technology Co Ltd
Qizhi Software Beijing Co Ltd
Original Assignee
Beijing Qihoo Technology Co Ltd
Qizhi Software Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Qihoo Technology Co Ltd, Qizhi Software Beijing Co Ltd filed Critical Beijing Qihoo Technology Co Ltd
Priority to CN201410439878.5A priority Critical patent/CN104156490A/en
Publication of CN104156490A publication Critical patent/CN104156490A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/562Static detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2221/00Indexing scheme relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F2221/21Indexing scheme relating to G06F21/00 and subgroups addressing additional information or applications relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F2221/2119Authenticating web pages, e.g. with suspicious links

Abstract

The invention discloses a method for detecting a suspicious fishing webpage based on character recognition. The method comprises the steps that screenshot is conducted on a webpage opened by a browser, and a webpage picture is obtained; recognition processing is conducted on the webpage picture, and characters in the webpage picture are obtained; the characters in the webpage picture serve as a matching input item, the matching input item is matched with preset rules, and whether the webpage is the suspicious fishing webpage or not is detected. According to the method for detecting the suspicious fishing webpage based on character recognition, the method that screenshot is conducted on the webpage, the characters in the webpage picture are recognized and the recognized characters are used for detecting whether the webpage is suspicious fishing is adopted, the defect that whether the webpage is the suspicious fishing webpage or not cannot be detected by using the characters when the content of the webpage only contains the picture can be overcome, and the detecting rage is larger.

Description

Method and device based on the suspicious fishing webpage of word recognition detection
Technical field
The present invention relates to Internet technical field, be specifically related to a kind of method and device based on the suspicious fishing webpage of word recognition detection.
Background technology
Along with the development of internet, the application based on WEB is day by day universal, and people can be inquired about bank account, shopping online, ecommerce, Query Information, obtain knowledge, be carried out amusement etc. by browser, and WEB provides a convenient and interactive mode efficiently.Yet people, in the browsing page of surfing the web, often can suffer from the invasion and attack of malicious websites, cause computing machine by infection such as virus, wooden horses.
Malicious websites, for example fishing website or swindle, counterfeit website etc., mainly URL address or the web page contents by counterfeit true website, the website of the types such as bank and ecommerce of disguising oneself as, or utilize the leak in true Website server program, and in some webpage of this website, insert dangerous web page code, with this, gain user bank or the private data such as credit card account, password by cheating.The feature that is comprising many sensitivities in malicious web pages, for example, the malicious web pages of financial fraud class can be in counterfeit official website, the aspects such as word, picture, or in true webpage, insert the information such as false ticketing service, false prize-winning, personation Net silver, false shopping, these features appear in webpage mainly with the form of text string greatly.
The existing Main Means for preventing malice website is when user accesses certain website, the black and white lists database that client is sent to server end by the URL of website is inquired about, yet because fishing website constantly update to be regenerated, this method is not high and have a hysteresis quality to the recall rate of the malicious websites such as fishing website.Or by extracting keyword in the page, and keyword is uploaded onto the server, in blacklist database, mate, this method efficiency is lower, and being easily subject to the impact of the factors such as word order, False Rate is higher, and cannot detect when web page contents only comprises picture.
Summary of the invention
In view of the above problems, the present invention has been proposed to provide a kind of method based on the suspicious fishing webpage of word recognition detection that overcomes the problems referred to above or address the above problem at least in part and the corresponding device based on the suspicious fishing webpage of word recognition detection.
According to an aspect of the present invention, provide a kind of method based on the suspicious fishing webpage of word recognition detection, having comprised:
The webpage that browser is opened carries out sectional drawing, obtains Web page picture;
Described Web page picture is carried out to identifying processing, obtain the word in described Web page picture;
Using the word in described Web page picture as coupling input item, whether by described coupling input item is mated with preset rules, detecting described webpage is suspicious fishing webpage.
According to a further aspect in the invention, provide a kind of device based on the suspicious fishing webpage of word recognition detection, having comprised:
Web page picture acquisition module, the webpage that is suitable for browser to open carries out sectional drawing, obtains Web page picture;
Word identification module, is suitable for described Web page picture to carry out identifying processing, obtains the word in described Web page picture;
Matching module, is suitable for using the word in described Web page picture as coupling input item, and whether by described coupling input item is mated with preset rules, detecting described webpage is suspicious fishing webpage.
According to the scheme providing of the present invention, the webpage that browser is opened carries out sectional drawing, obtains Web page picture; Web page picture is carried out to identifying processing, obtain the word in Web page picture; Using the word in Web page picture as coupling input item, whether by coupling input item is mated with preset rules, detecting webpage is suspicious fishing webpage.The present embodiment adopts webpage is carried out to sectional drawing, then identify the word in Web page picture, whether the text detection webpage that utilization identifies is the method for suspicious fishing, in the time of can avoiding web page contents only to comprise picture, cannot utilize whether text detection webpage is the defect of suspicious fishing webpage, its sensing range is larger.
Above-mentioned explanation is only the general introduction of technical solution of the present invention, in order to better understand technological means of the present invention, and can be implemented according to the content of instructions, and for above and other objects of the present invention, feature and advantage can be become apparent, below especially exemplified by the specific embodiment of the present invention.
Accompanying drawing explanation
By reading below detailed description of the preferred embodiment, various other advantage and benefits will become cheer and bright for those of ordinary skills.Accompanying drawing is only for the object of preferred implementation is shown, and do not think limitation of the present invention.And in whole accompanying drawing, by identical reference symbol, represent identical parts.In the accompanying drawings:
Fig. 1 shows the process flow diagram of the method based on the suspicious fishing webpage of word recognition detection according to an embodiment of the invention;
Fig. 2 shows the process flow diagram of the method based on the suspicious fishing webpage of word recognition detection according to another embodiment of the invention;
Fig. 3 shows the process flow diagram of the method based on the suspicious fishing webpage of word recognition detection according to still another embodiment of the invention;
Fig. 4 shows the process flow diagram of the method based on the suspicious fishing webpage of word recognition detection according to still another embodiment of the invention;
Fig. 5 shows the structured flowchart of the device based on the suspicious fishing webpage of word recognition detection according to an embodiment of the invention;
Fig. 6 shows the structured flowchart of the device based on the suspicious fishing webpage of word recognition detection in accordance with another embodiment of the present invention.
Embodiment
Exemplary embodiment of the present disclosure is described below with reference to accompanying drawings in more detail.Although shown exemplary embodiment of the present disclosure in accompanying drawing, yet should be appreciated that and can realize the disclosure and the embodiment that should do not set forth limits here with various forms.On the contrary, it is in order more thoroughly to understand the disclosure that these embodiment are provided, and can by the scope of the present disclosure complete convey to those skilled in the art.
Fig. 1 shows the process flow diagram of the method based on the suspicious fishing webpage of word recognition detection according to an embodiment of the invention.As shown in Figure 1, the method comprises the following steps:
Step S100, the webpage that browser is opened carries out sectional drawing, obtains Web page picture.
The webpage that browser is opened refers to the webpage that user opens by the browser of client, and wherein client can be computing machine, mobile terminal, flat-panel devices etc., and the present invention does not limit this.The corresponding URL of each webpage (URL(uniform resource locator)), server is after the URL of webpage that gets user's access, show the webpage that this URL is corresponding and carry out sectional drawing, obtain Web page picture, the content of the webpage that wherein content of Web page picture is opened by browser in client with user is identical.
Step S110, carries out identifying processing by Web page picture, obtains the word in Web page picture.
Webpage generally all can comprise word and picture, and the Web page picture therefore obtaining can comprise word and the picture in webpage, after obtaining Web page picture, Web page picture is carried out to identifying processing, obtains the word in Web page picture.
Step S120, using the word in Web page picture as coupling input item, whether by coupling input item is mated with preset rules, detecting webpage is suspicious fishing webpage.
The coupling input item here refers to the source item of coming while mating with preset rules, and it mates with the reference items of setting in preset rules, and this coupling input item is dynamic change, and it depends on the word in Web page picture.Preset rules has defined the condition that coupling input item need to be satisfied, if coupling input item satisfies condition, can judge that webpage is suspicious fishing webpage; If coupling input item does not satisfy condition, can judge that webpage is not suspicious fishing webpage.
The method providing according to the above embodiment of the present invention, the webpage that browser is opened carries out sectional drawing, obtains Web page picture; Web page picture is carried out to identifying processing, obtain the word in Web page picture; Using the word in Web page picture as coupling input item, whether by coupling input item is mated with preset rules, detecting webpage is suspicious fishing webpage.The present embodiment adopts webpage is carried out to sectional drawing, then identify the word in Web page picture, whether the text detection webpage that utilization identifies is the method for suspicious fishing, in the time of can avoiding web page contents only to comprise picture, cannot utilize whether text detection webpage is the defect of suspicious fishing webpage, its sensing range is larger.
Fig. 2 shows the process flow diagram of the method based on the suspicious fishing webpage of word recognition detection according to another embodiment of the invention.As shown in Figure 2, the method comprises the following steps:
Step S200, classifies to a plurality of samples pictures, obtains at least one samples pictures under each classification.
Particularly, a plurality of known fishing webpages are carried out to sectional drawing, using the picture of intercepting as samples pictures, then according to the type of website, a plurality of samples pictures are classified, for example according to Taobao, network game, conclude the business, get the winning number in a bond, recruit part-time sorting technique samples pictures is classified, obtain at least one samples pictures under each classification.Those skilled in the art can according to practical application need to select suitable criteria for classification, be not specifically limited herein.
Step S210, for each classification, intercepts the character area piece in each samples pictures under this classification.
Particularly, character area piece is the picture with word intercepting from samples pictures, and the size of this picture is less than the size of samples pictures.Alternatively, take manual type from according to intercepting character area piece the samples pictures of above-mentioned classification.
Step S220 extracts word as the training sample set of this classification from character area piece.
After obtaining character area piece, extract the word in character area piece, the training sample set using the word extracting as each classification.
Step S230, trains based on training sample set, obtains the word identification storehouse that this classifies corresponding.
For the webpage under follow-up a certain classification, according to the similar method intercepting page of step S200-step S220 picture, and therefrom extract word, the word extracting based on training sample set pair is word for word trained, and the word after training is added in sample set, finally obtain the word identification storehouse that each classifies corresponding.The word extracting of take is given an example as " Taobao ", training sample set based under Taobao's classification, respectively naughty, precious, three words of net are trained, after training finishes, add naughty, precious, three words of net to training sample set under Taobao's classification, through a large amount of training, obtain the word identification storehouse that this classifies corresponding.
Above-mentioned steps S200-step S230, for the fishing webpage based on known builds the process that storehouse identified in each word of classifying corresponding, identifies according to the word building the webpage that just can open browser in storehouse and identifies, referring to following steps.It should be noted that, the present embodiment is not limited only to adopt said method to build word identification storehouse, also can adopt other method, and the directly unified sample training that carries out of for example not classifying obtains word identification storehouse.
Step S240, the webpage that browser is opened carries out sectional drawing, obtains Web page picture.
The webpage that browser is opened refers to the webpage that user opens by the browser of client, and wherein client can be computing machine, mobile terminal, flat-panel devices etc., and the present invention does not limit this.The corresponding URL of each webpage, server is after the URL of webpage that gets user's access, show the webpage that this URL is corresponding and carry out sectional drawing, obtaining Web page picture, the content of the webpage that wherein content of Web page picture is opened by browser in client with user is identical.
Step S250, searches the character area piece in Web page picture.
Alternatively, the present embodiment utilizes swt (Stationary Wavelet Transform) algorithm to search the word segment in Web page picture, determines character area piece.Particularly, to swt algorithms library input Web page picture; Obtain the coordinate parameters of swt algorithms library output; According to the character area piece in coordinate parameters intercepting page picture.That is, obtained Web page picture is sent to swt algorithms library, calculate the coordinate of Web page picture Chinese word part through swt Algorithm Analysis, server obtains the coordinate parameters calculating, and according to this coordinate parameters, intercepts character area piece from Web page picture.
Step S260, the word of classifying corresponding based on each identification storehouse, carries out optical character recognition process by the character area piece in Web page picture, obtains the word in Web page picture, using the word in Web page picture as a coupling input item.
Particularly, optical character identification is a kind of by image file is carried out to analyzing and processing, obtains the wherein technology of Word message.Character area in Web page picture is carried out after optical character recognition process, obtain character and graphic, comparing in the word identification storehouse that resulting character and graphic and each are classified corresponding, when resulting character and graphic hits the word in word identification storehouse, determines the word in Web page picture.Because word identification storehouse is corresponding with classification, so in the word of this step in identifying Web page picture, also can identify the affiliated classification of webpage.
Mate input item herein and refer to the source item of coming while mating with preset rules, it mates with the reference items of setting in preset rules.The coupling input item of the present embodiment is multinomial, and wherein the word in Web page picture is wherein one.
Step S270, obtains the visit capacity of the affiliated website of webpage in Preset Time section, and visit capacity is mated to input item as another.
Particularly, obtain under webpage the visit capacity of website in Preset Time section and mainly judge under webpage whether website is new website.Server can obtain by the record of browser access webpage the accessed situation of website under webpage.The Preset Time section of take was example as 1 month, and server obtains the Visitor Logs of webpage in 1 month, according to the number of the Visitor Logs of this webpage, determined the visit capacity of website in 1 month under webpage.
Step S280, judges whether the word in Web page picture comprises key word, and judges whether visit capacity is less than preset value, if the word in Web page picture comprises key word and visit capacity is less than preset value, performs step S290; If the word in Web page picture does not comprise key word and/or visit capacity is greater than preset value, method finishes.
This step is specially mates above-mentioned coupling input item with preset rules.Here preset rules comprises: in word, comprising the affiliated website of responsive key word and webpage is new site.It is part-time etc. that responsive key word can be that Taobao, network game are concluded the business, got the winning number in a bond, recruit.The preset value of visit capacity is set to 100 times/month, and the site definitions that visit capacity is less than to preset value 100 times/month is new website.This preset value is to set according to the record of a plurality of user's accessed web pages, and those skilled in the art can arrange this preset value as required voluntarily, are not specifically limited herein.Particularly, in step S260 and step S270, the visit capacity of website under the word in Web page picture and webpage is defined as mating input item, need in this step the visit capacity of website under the word in Web page picture and webpage to judge, judge whether the word in Web page picture comprises responsive key word, and whether visit capacity is less than preset value 100 times/month.
Step S290, detecting webpage is suspicious fishing webpage.
Comprise such as Taobao, network game and conclude the business, get the winning number in a bond, recruit the responsive key words such as part-time judging webpage, and under webpage, the visit capacity of website is less than preset value 100 times/month, after website is new website under webpage, judges that webpage is suspicious fishing webpage.
The method providing according to the present embodiment, classifies to a plurality of samples pictures, obtains at least one samples pictures under each classification; For each classification, intercept the character area piece in each samples pictures under this classification; From character area piece, extract word as the training sample set of this classification; Based on training sample set, train, obtain the word identification storehouse that this classifies corresponding; The webpage that browser is opened carries out sectional drawing, obtains Web page picture; Search the character area piece in Web page picture; The word identification storehouse of classifying corresponding based on each, carries out optical character recognition process by the character area piece in Web page picture, obtains the word in Web page picture; Word in Web page picture is mated with preset rules as coupling input item; Obtain the visit capacity of the affiliated website of webpage in Preset Time section, visit capacity is mated with preset rules as coupling input item; Judge whether the word in Web page picture comprises key word, and judge whether visit capacity is less than preset value, if the word in Web page picture comprises key word and visit capacity is less than preset value, detecting webpage is suspicious fishing webpage.In the present embodiment, by searching the character area piece in Web page picture, can dwindle follow-up identification literal scope, improved recognition accuracy, make word identification quality not only depend on the result of optical character identification, based on existing webpage, set up word identification storehouse, and according to this word identification storehouse, the word in Web page picture is identified, improved recognition accuracy; Visitor Logs by user is the visit capacity of website under statistical web page easily, this process not only fast but also convenient, to judge whether the visit capacity of website under webpage is less than word in preset value and Web page picture and whether comprises two kinds of key words and detect regular combinations, has reduced rate of false alarm.
Above-described embodiment by detect in word, whether comprise under key word and webpage website be whether new website whether detect webpage be suspicious fishing webpage, the present invention is not limited only to this, whether can also detect webpage by additive method is suspicious fishing webpage.In the following embodiments, all based on the described word identification of above-described embodiment storehouse, detect identification, the process of establishing in word identification storehouse repeats no more.
Fig. 3 shows the process flow diagram of the method based on the suspicious fishing webpage of word recognition detection according to still another embodiment of the invention.As shown in Figure 3, the method comprises the following steps:
Step S300, the webpage that browser is opened carries out sectional drawing, obtains Web page picture.
The webpage that browser is opened refers to the webpage that user opens by the browser of client, and wherein client can be computing machine, mobile terminal, flat-panel devices etc., and the present invention does not limit this.The corresponding URL of each webpage, server is after the URL of webpage that gets user's access, show the webpage that this URL is corresponding and carry out sectional drawing, obtaining Web page picture, the content of the webpage that wherein content of Web page picture is opened by browser in client with user is identical.
Step S310, searches the character area piece in Web page picture.
Alternatively, the present embodiment utilizes swt (Stationary Wavelet Transform) algorithm to search the word segment in Web page picture, determines character area piece.Particularly, to swt algorithms library input Web page picture; Obtain the coordinate parameters of swt algorithms library output; According to the character area piece in coordinate parameters intercepting page picture.That is, obtained Web page picture is sent to swt algorithms library, calculate the coordinate of Web page picture Chinese word part through swt Algorithm Analysis, server obtains the coordinate parameters calculating, and according to this coordinate parameters, intercepts character area piece from Web page picture.
Step S320, the word of classifying corresponding based on each identification storehouse, carries out optical character recognition process by the character area piece in Web page picture, obtains the word in Web page picture, using the word in Web page picture as a coupling input item.
Particularly, optical character identification is a kind of by image file is carried out to analyzing and processing, obtains the wherein technology of Word message.Character area in Web page picture is carried out after optical character recognition process, obtain character and graphic, comparing in the word identification storehouse that resulting character and graphic and each are classified corresponding, when resulting character and graphic hits the word in word identification storehouse, determines the word in Web page picture.Because word identification storehouse is corresponding with classification, so in the word of this step in identifying Web page picture, also can identify the affiliated classification of webpage.
Mate input item herein and refer to the source item of coming while mating with preset rules, it mates with the reference items of setting in preset rules.The coupling input item of the present embodiment is multinomial, and wherein the word in Web page picture is wherein one.
Step S330, obtains the information of website under webpage, and under webpage, the information of website is mated input item as another.
Particularly, obtain the information of website under webpage, by judging under webpage whether the information of website has enterprise's record information and judge under webpage whether website is the website that enterprise puts on record.Site information can be inquired about on query site server.
Step S340, judge whether the word in Web page picture comprises key word, and whether website has enterprise's record information under judging webpage, if the word in Web page picture comprises key word and the affiliated website of webpage does not have enterprise's record information, perform step S350; If the word in Web page picture does not comprise key word and/or the affiliated website of webpage has enterprise's record information, method finishes.
This step is specially mates above-mentioned coupling input item with preset rules.Here preset rules comprises: in word, comprising website under responsive key word, webpage is the website that nonbusiness puts on record.The key word is here responsive key word, concludes the business, gets the winning number in a bond, recruits part-time etc. such as Taobao, network game.
Step S350, detecting webpage is suspicious fishing webpage.
Comprise such as Taobao, network game and conclude the business, get the winning number in a bond, recruit the responsive key words such as part-time judging webpage, and under webpage, website does not have enterprise's record information, after website is the website put on record of nonbusiness under webpage, judge that webpage is suspicious fishing webpage.
In addition, also can be by the method for Fig. 2 and combination embodiment illustrated in fig. 3, by judging whether to comprise in word, reprint under key word, webpage the visit capacity of website and whether be less than under preset value and webpage website and whether there is enterprise's record information whether detect webpage be suspicious fishing webpage, do not repeat them here.
The method providing according to the present embodiment, by searching the character area piece in Web page picture, can dwindle follow-up identification literal scope, improved recognition accuracy, make word identification quality not only depend on the result of optical character identification, based on existing webpage, set up word identification storehouse, and according to this word identification storehouse, the word in Web page picture is identified, improved recognition accuracy; By detecting website under webpage, whether there is word in enterprise's record information and Web page picture and whether comprise key word whether detect webpage be suspicious fishing webpage, reduced rate of false alarm.
Fig. 4 shows the process flow diagram of the method based on the suspicious fishing webpage of word recognition detection according to still another embodiment of the invention.As shown in Figure 4, the method comprises the following steps:
Step S400, the webpage that browser is opened carries out sectional drawing, obtains Web page picture.
The webpage that browser is opened refers to the webpage that user opens by the browser of client, and wherein client can be computing machine, mobile terminal, flat-panel devices etc., and the present invention does not limit this.The corresponding URL of each webpage, server is after the URL of webpage that gets user's access, show the webpage that this URL is corresponding and carry out sectional drawing, obtaining Web page picture, the content of the webpage that wherein content of Web page picture is opened by browser in client with user is identical.
Step S410, searches the character area piece in Web page picture.
Alternatively, the present embodiment utilizes swt (Stationary Wavelet Transform) algorithm to search the word segment in Web page picture, determines character area piece.Particularly, to swt algorithms library input Web page picture; Obtain the coordinate parameters of swt algorithms library output; According to the character area piece in coordinate parameters intercepting page picture.That is, obtained Web page picture is sent to swt algorithms library, calculate the coordinate of Web page picture Chinese word part through swt Algorithm Analysis, server obtains the coordinate parameters calculating, and according to this coordinate parameters, intercepts character area piece from Web page picture.
Step S420, the word of classifying corresponding based on each identification storehouse, carries out optical character recognition process by the character area piece in Web page picture, obtains the word in Web page picture, using the word in Web page picture as a coupling input item.
Particularly, optical character identification is a kind of by image file is carried out to analyzing and processing, obtains the wherein technology of Word message.Character area in Web page picture is carried out after optical character recognition process, obtain character and graphic, comparing in the word identification storehouse that resulting character and graphic and each are classified corresponding, when resulting character and graphic hits the word in word identification storehouse, determines the word in Web page picture.Because word identification storehouse is corresponding with classification, so in the word of this step in identifying Web page picture, also can identify the affiliated classification of webpage.
Mate input item herein and refer to the source item of coming while mating with preset rules, it mates with the reference items of setting in preset rules.The coupling input item of the present embodiment is multinomial, and wherein the word in Web page picture is wherein one.
Step S430, adopts and based on URL and/or IP and/or web page contents, to know method for distinguishing described webpage is identified, and recognition result is mated to input item as another.
Whether step S440, comprise judged result and the described recognition result of key word in conjunction with the word in described Web page picture, whether detect described webpage is suspicious fishing webpage.
Below in detail introduce whether comprise the judged result of key word and the result of three kinds of web page identification methods in conjunction with the word in Web page picture, whether discuss webpage is suspicious fishing webpage:
Paper is identified based on URL, whether the URL that judges the webpage of user's access belongs to Local Black/white list, using the result of judgement as coupling input item, if the URL of the webpage of judgement user access belongs to Local Black list, no matter whether the word in Web page picture comprises key word, webpage is all suspicious fishing webpage; If the URL of the webpage of judgement user access belongs to local white list, no matter whether the word in Web page picture comprises key word, webpage is not suspicious fishing webpage.
Secondly introduce and identify based on IP, whether the URL that judges the webpage of user's access belongs to high-risk domain name or high-risk IP, using the result of judgement as coupling input item, if the URL of the webpage of judgement user access belongs to high-risk domain name or high-risk IP, and the word in Web page picture comprises key word, detecting webpage is suspicious fishing webpage; If the URL of the webpage of judgement user access does not belong to high-risk domain name or high-risk IP, no matter whether the word in Web page picture comprises key word, webpage is not suspicious fishing webpage.
Finally introduce based on web page contents identification, based on web page contents recognition methods, can be following two kinds:
<1>: after accessed web page has loaded, extract a plurality of web page elements of webpage, wherein, web page element can be content of text on the page binary coded data after by specific code conversion; Obtain each web page element characteristic of correspondence ID, wherein web page element is unique corresponding with characteristic ID; A plurality of web page element characteristic of correspondence ID are mated with the characteristic ID in feature database, and wherein, feature database is generated by server, is the set of the malicious web pages element characteristic of correspondence ID in a large amount of fishing webpage; If have at least one characteristic ID of mating with the characteristic ID in feature database in a plurality of web page element characteristic of correspondence ID, according to the information of webpage, the assessment of calculating webpage divides; The assessment of webpage is divided with setting threshold and compared, and using result relatively as coupling input item, if the assessment of webpage divides higher than setting threshold, and the word in Web page picture comprises key word, and detecting webpage is suspicious fishing webpage; If the assessment of webpage divides lower than setting threshold, no matter whether the word in Web page picture comprises key word, webpage is not suspicious fishing webpage.
<2>: after the webpage event of browser has loaded, extract a plurality of web page elements of webpage, wherein, web page element can be content of text on the page binary coded data after by specific code conversion; Obtain each web page element characteristic of correspondence ID, wherein web page element is unique corresponding with characteristic ID; A plurality of web page element characteristic of correspondence ID are mated with the characteristic ID in feature database, and wherein, feature database is generated by server, is the set of the malicious web pages element characteristic of correspondence ID in a large amount of fishing webpage; If there is at least one characteristic ID of mating with the characteristic ID in feature database in a plurality of web page element characteristic of correspondence ID, at least one characteristic ID of this coupling is combined and calculates at least one template ID, wherein, template ID can be by the binary data of the specific fast fixed length that hash algorithm generates, for example, the binary string of 64 generating by MD5 algorithm, with characteristic ID similarly, different characteristic ID is according to identical sequential combination, and the template ID calculating according to identical algorithm is unique and is irreversible; At least one template ID is mated with the template ID in fishing webpage identification storehouse, using the result of coupling as coupling input item, if have at least one template ID with fishing webpage and identify the template ID that the template ID in storehouse mates, and the word in Web page picture is key word, detecting webpage is suspicious fishing webpage; If do not have at least one template ID with fishing webpage, do not identify the template ID that the template ID in storehouse mates, no matter whether the word in Web page picture comprises key word, webpage is not suspicious fishing webpage.
The device providing according to the above embodiment of the present invention, by searching the character area piece in Web page picture, can dwindle follow-up identification literal scope, improved recognition accuracy, make word identification quality not only depend on the result of optical character identification, based on existing webpage, set up word identification storehouse, and according to this word identification storehouse, the word in Web page picture is identified, improved recognition accuracy; In conjunction with whether comprising the judged result of key word and the recognition result that the device based on URL and/or IP and/or web page contents are identified is identified webpage in Web page picture, reduced rate of false alarm, accelerated detection speed.
Fig. 5 shows the structured flowchart of the device based on the suspicious fishing webpage of word recognition detection according to an embodiment of the invention.As shown in Figure 5, this device comprises: Web page picture acquisition module 500, word identification module 510, matching module 520.
Web page picture acquisition module 500, the webpage that is suitable for browser to open carries out sectional drawing, obtains Web page picture.
The webpage that browser is opened refers to the webpage that user opens by the browser of client, and wherein client can be computing machine, mobile terminal, flat-panel devices etc., and the present invention does not limit this.The corresponding URL of each webpage (URL(uniform resource locator)), server is after the URL of webpage that gets user's access, show the webpage that this URL is corresponding and carry out sectional drawing, obtain Web page picture, the content of the webpage that wherein content of Web page picture is opened by browser in client with user is identical.
Word identification module 510, is suitable for Web page picture to carry out identifying processing, obtains the word in Web page picture.
Webpage generally all can comprise word and picture, and the Web page picture therefore obtaining can comprise word and the picture in webpage, after obtaining Web page picture, Web page picture is carried out to identifying processing, obtains the word in Web page picture.
Matching module 520, is suitable for using the word in Web page picture as coupling input item, and whether by coupling input item is mated with preset rules, detecting webpage is suspicious fishing webpage.
The coupling input item here refers to the source item of coming while mating with preset rules, and it mates with the reference items of setting in preset rules, and this coupling input item is dynamic change, and it depends on the word in Web page picture.Preset rules has defined the condition that coupling input item need to be satisfied, if coupling input item satisfies condition, can judge that webpage is suspicious fishing webpage; If coupling input item does not satisfy condition, can judge that webpage is not suspicious fishing webpage.
The device providing according to the above embodiment of the present invention, the webpage that browser is opened carries out sectional drawing, obtains Web page picture; Web page picture is carried out to identifying processing, obtain the word in Web page picture; Using the word in Web page picture as coupling input item, whether by coupling input item is mated with preset rules, detecting webpage is suspicious fishing webpage.The present embodiment adopts webpage is carried out to sectional drawing, then identify the word in Web page picture, whether the text detection webpage that utilization identifies is the method for suspicious fishing, in the time of can avoiding web page contents only to comprise picture, cannot utilize whether text detection webpage is the defect of suspicious fishing webpage, its sensing range is larger.
Fig. 6 shows the structured flowchart of the device based on the suspicious fishing webpage of word recognition detection in accordance with another embodiment of the present invention.As shown in Figure 6, this device comprises: sort module 600, interception module 610, extraction module 620, training module 630, word identification storehouse 640, Web page picture acquisition module 650, word identification module 660, matching module 670.
Sort module 600, is suitable for a plurality of samples pictures to classify, and obtains at least one samples pictures under each classification.
Particularly, a plurality of different web pages are carried out to sectional drawing, using the picture of intercepting as samples pictures, a plurality of samples pictures are classified according to the type of website, for example according to Taobao, network game transaction, recruit part-time sorting technique samples pictures classified, obtain at least one samples pictures under each classification.Those skilled in the art can according to practical application need to select suitable criteria for classification, be not specifically limited herein.
Interception module 610, is suitable for, for each classification, intercepting the character area piece in each samples pictures under this classification.
Particularly, character area piece is the picture with word intercepting from samples pictures, and the size of this picture is less than the size of samples pictures.Related personnel utilizes mouse from according to intercepting character area piece the samples pictures of above-mentioned classification.
Extraction module 620, is suitable for extracting word from character area piece as the training sample set of this classification.
After obtaining character area piece, extract the word in character area piece, the training sample set using the word extracting as each classification.
Training module 630, is suitable for training based on training sample set.
Word identification storehouse 640, is suitable for the word after storage training.
Web page picture acquisition module 650, the webpage that is suitable for browser to open carries out sectional drawing, obtains Web page picture.
The webpage that browser is opened refers to the webpage that user opens by the browser of client, and wherein client can be computing machine, mobile terminal, flat-panel devices etc., and the present invention does not limit this.The corresponding URL of each webpage (URL(uniform resource locator)), server is after the URL of webpage that gets user's access, show the webpage that this URL is corresponding and carry out sectional drawing, obtain Web page picture, the content of the webpage that wherein content of Web page picture is opened by browser in client with user is identical.
Word identification module 660, is suitable for Web page picture to carry out identifying processing, obtains the word in Web page picture.
Word identification module 660 further comprises: search unit 710, be suitable for searching the character area piece in Web page picture.
Particularly, utilize swt algorithm to search the word segment in Web page picture, determine character area piece.Wherein, search unit 490 and further comprise: input block 730, is suitable for to swt algorithms library input Web page picture.
Acquiring unit 740, is suitable for obtaining the coordinate parameters of swt algorithms library output.
Interception unit 750, is suitable for according to the character area piece in coordinate parameters intercepting page picture.
Word recognition unit 720, is suitable for the character area piece in Web page picture to carry out optical character recognition process, obtains the word in Web page picture.Particularly, obtained Web page picture is sent to swt algorithms library, calculate the coordinate of Web page picture Chinese word part through swt Algorithm Analysis, server obtains the coordinate parameters calculating, and according to this coordinate parameters, intercepts character area piece from Web page picture.
Wherein, word recognition unit 720 is specifically suitable for: the word identification storehouse of classifying corresponding based on each, the character area piece in Web page picture is carried out to optical character recognition process, and obtain the word in described Web page picture.
Particularly, optical character identification is a kind of by image file is carried out to analyzing and processing, obtains the wherein technology of Word message.Character area in Web page picture is carried out after optical character recognition process, obtain character and graphic, comparing in the word identification storehouse that resulting character and graphic and each are classified corresponding, when resulting character and graphic hits the word in word identification storehouse, determines the word in Web page picture.
Matching module 670, is suitable for using the word in Web page picture as coupling input item, and whether by coupling input item is mated with preset rules, detecting webpage is suspicious fishing webpage.
This device also comprises: visit capacity acquisition module 680, is suitable for obtaining the visit capacity of the affiliated website of webpage in Preset Time section, using visit capacity as coupling input item.
Particularly, obtain under webpage the visit capacity of website in Preset Time section and mainly judge under webpage whether website is new website.Server can obtain by the record of browser access webpage the accessed situation of website under webpage.The Preset Time section of take was example as 1 month, and server obtains the Visitor Logs of webpage in 1 month, according to the number of the Visitor Logs of this webpage, determined the visit capacity of website in 1 month under webpage.
Matching module 670 is specifically suitable for: judge whether the word in Web page picture comprises key word, and judge whether visit capacity is less than preset value; If the word in judgement Web page picture comprises key word and visit capacity is less than preset value, detecting webpage is suspicious fishing webpage.
This device also comprises: record information acquisition module 690, is suitable for obtaining the record information of website under described webpage, using described record information as coupling input item.
Described matching module 670 is specifically suitable for: judge whether the word in described Web page picture comprises key word, and under judging described webpage, whether website has enterprise's record information; If judge word in described Web page picture comprise key word and described webpage under website do not there is enterprise's record information, detecting described webpage is suspicious fishing webpage.
This device also comprises: webpage identification module 700, is suitable for adopting the device of identifying based on URL and/or IP and/or web page contents to identify described webpage, using recognition result as coupling input item;
Described matching module 670 is further adapted for: whether in conjunction with the word in described Web page picture, whether comprise judged result and the described recognition result of key word, detecting described webpage is suspicious fishing webpage.The detection mode of matching module 670 can be referring to the description of embodiment of the method.
The device providing according to the above embodiment of the present invention, classifies to a plurality of samples pictures, obtains at least one samples pictures under each classification; For each classification, intercept the character area piece in each samples pictures under this classification; From character area piece, extract word as the training sample set of this classification; Based on training sample set, train, obtain the word identification storehouse that this classifies corresponding; The webpage that browser is opened carries out sectional drawing, obtains Web page picture; Search the character area piece in Web page picture; The word identification storehouse of classifying corresponding based on each, carries out optical character recognition process by the character area piece in Web page picture, obtains the word in Web page picture; Word in Web page picture is mated with preset rules as coupling input item; Obtain the visit capacity of the affiliated website of webpage in Preset Time section, visit capacity is mated with preset rules as coupling input item; Judge whether the word in Web page picture comprises key word, and judge whether visit capacity is less than preset value, if the word in Web page picture comprises key word and visit capacity is less than preset value, detecting webpage is suspicious fishing webpage.In the present embodiment, by searching the character area piece in Web page picture, can dwindle follow-up identification literal scope, improved recognition accuracy, make word identification quality not only depend on the result of optical character identification, based on existing webpage, set up word identification storehouse, and according to this word identification storehouse, the word in Web page picture is identified, improved recognition accuracy, Visitor Logs by user is the visit capacity of website under statistical web page easily, this process not only fast but also convenient, to judge whether the visit capacity of website under webpage is less than word in preset value and Web page picture and whether comprises two kinds of key words and detect regular combinations, or whether the word that under detection webpage, whether website has in enterprise's record information and Web page picture is comprised to two kinds of regular combinations of key word, or in conjunction with whether comprising the judged result of key word and the recognition result that the device based on URL and/or IP and/or web page contents are identified is identified webpage in Web page picture, whether detect webpage is suspicious fishing webpage, reduced rate of false alarm, accelerated detection speed.
The algorithm providing at this is intrinsic not relevant to any certain computer, virtual system or miscellaneous equipment with demonstration.Various general-purpose systems also can with based on using together with this teaching.According to description above, it is apparent constructing the desired structure of this type systematic.In addition, the present invention is not also for any certain programmed language.It should be understood that and can utilize various programming languages to realize content of the present invention described here, and the description of above language-specific being done is in order to disclose preferred forms of the present invention.
In the instructions that provided herein, a large amount of details have been described.Yet, can understand, embodiments of the invention can not put into practice in the situation that there is no these details.In some instances, be not shown specifically known method, structure and technology, so that not fuzzy understanding of this description.
Similarly, be to be understood that, in order to simplify the disclosure and to help to understand one or more in each inventive aspect, in the above in the description of exemplary embodiment of the present invention, each feature of the present invention is grouped together into single embodiment, figure or sometimes in its description.Yet, the method for the disclosure should be construed to the following intention of reflection: the present invention for required protection requires than the more feature of feature of clearly recording in each claim.Or rather, as reflected in claims below, inventive aspect is to be less than all features of disclosed single embodiment above.Therefore, claims of following embodiment are incorporated to this embodiment thus clearly, and wherein each claim itself is as independent embodiment of the present invention.
Those skilled in the art are appreciated that and can the module in the equipment in embodiment are adaptively changed and they are arranged in one or more equipment different from this embodiment.Module in embodiment or unit or assembly can be combined into a module or unit or assembly, and can put them into a plurality of submodules or subelement or sub-component in addition.At least some in such feature and/or process or unit are mutually repelling, and can adopt any combination to combine all processes or the unit of disclosed all features in this instructions (comprising claim, summary and the accompanying drawing followed) and disclosed any method like this or equipment.Unless clearly statement in addition, in this instructions (comprising claim, summary and the accompanying drawing followed) disclosed each feature can be by providing identical, be equal to or the alternative features of similar object replaces.
In addition, those skilled in the art can understand, although embodiment more described herein comprise some feature rather than further feature included in other embodiment, the combination of the feature of different embodiment means within scope of the present invention and forms different embodiment.For example, in the following claims, the one of any of embodiment required for protection can be used with array mode arbitrarily.
All parts embodiment of the present invention can realize with hardware, or realizes with the software module moved on one or more processor, or realizes with their combination.It will be understood by those of skill in the art that and can use in practice microprocessor or digital signal processor (DSP) to realize according to the some or all functions of the some or all parts in the equipment based on the suspicious fishing webpage of word recognition detection of the embodiment of the present invention.The present invention for example can also be embodied as, for carrying out part or all equipment or device program (, computer program and computer program) of method as described herein.Realizing program of the present invention and can be stored on computer-readable medium like this, or can there is the form of one or more signal.Such signal can be downloaded and obtain from internet website, or provides on carrier signal, or provides with any other form.
It should be noted above-described embodiment the present invention will be described rather than limit the invention, and those skilled in the art can design alternative embodiment in the situation that do not depart from the scope of claims.In the claims, any reference symbol between bracket should be configured to limitations on claims.Word " comprises " not to be got rid of existence and is not listed as element or step in the claims.Being positioned at word " " before element or " one " does not get rid of and has a plurality of such elements.The present invention can be by means of including the hardware of some different elements and realizing by means of the computing machine of suitably programming.In having enumerated the unit claim of some devices, several in these devices can be to carry out imbody by same hardware branch.The use of word first, second and C grade does not represent any order.Can be title by these word explanations.
The invention discloses: A1, a kind of method based on the suspicious fishing webpage of word recognition detection, comprising:
The webpage that browser is opened carries out sectional drawing, obtains Web page picture;
Described Web page picture is carried out to identifying processing, obtain the word in described Web page picture;
Using the word in described Web page picture as coupling input item, whether by described coupling input item is mated with preset rules, detecting described webpage is suspicious fishing webpage.
A2, according to the method described in A1, described Web page picture is carried out to identifying processing, the word obtaining in described Web page picture further comprises:
Search the character area piece in described Web page picture;
Character area piece in described Web page picture is carried out to optical character recognition process, obtain the word in described Web page picture.
A3, according to the method described in A2, described in the character area piece searched in Web page picture further comprise:
To swt algorithms library, input described Web page picture;
Obtain the coordinate parameters of described swt algorithms library output;
According to described coordinate parameters, intercept the character area piece in described Web page picture.
A4, according to the method described in A1-A3 any one, also comprise: obtain the visit capacity of website in Preset Time section under described webpage, using described visit capacity as coupling input item;
Whether described by described coupling input item is mated with preset rules, detecting described webpage is that fishing webpage further comprises:
Judge whether the word in described Web page picture comprises key word, and judge whether described visit capacity is less than preset value;
If the word in described Web page picture comprises key word and described visit capacity is less than preset value, detecting described webpage is suspicious fishing webpage.
A5, according to the method described in A1-A3 any one, whether described by described coupling input item is mated with preset rules, detecting described webpage is that fishing webpage further comprises:
Judge whether the word in described Web page picture comprises key word, and under judging described webpage, whether website has enterprise's record information;
If the word in described Web page picture comprises key word and the affiliated website of described webpage does not have enterprise's record information, detecting described webpage is suspicious fishing webpage.
A6, according to the method described in A1-A3 any one, also comprise: adopt and based on URL and/or IP and/or web page contents, to know method for distinguishing described webpage is identified, using recognition result as coupling input item;
It is described by described coupling input item is mated with preset rules, whether be fishing webpage further comprise: whether in conjunction with the word in described Web page picture, whether comprise judged result and the described recognition result of key word, detecting described webpage is suspicious fishing webpage if detecting described webpage.
A7, according to the method described in A2 or A3, at the described webpage that browser is opened, carry out sectional drawing, obtain before Web page picture and also comprise:
A plurality of samples pictures are classified, obtain at least one samples pictures under each classification;
For each classification, intercept the character area piece in each samples pictures under this classification, from character area piece, extract word as the training sample set of this classification, based on described training sample set, train, obtain the word identification storehouse that this classifies corresponding.
A8, according to the method described in A6, described character area piece in Web page picture is carried out to optical character recognition process, the word obtaining in described Web page picture is specially: the word identification storehouse of classifying corresponding based on each, character area piece in Web page picture is carried out to optical character recognition process, obtain the word in described Web page picture.
The invention also discloses: B9, a kind of device based on the suspicious fishing webpage of word recognition detection, comprising:
Web page picture acquisition module, the webpage that is suitable for browser to open carries out sectional drawing, obtains Web page picture;
Word identification module, is suitable for described Web page picture to carry out identifying processing, obtains the word in described Web page picture;
Matching module, is suitable for using the word in described Web page picture as coupling input item, and whether by described coupling input item is mated with preset rules, detecting described webpage is suspicious fishing webpage.
B10, according to the device described in B9, described word identification module further comprises:
Search unit, be suitable for searching the character area piece in described Web page picture;
Word recognition unit, is suitable for the character area piece in described Web page picture to carry out optical character recognition process, obtains the word in described Web page picture.
B11, according to the device described in B10, described in search unit and further comprise:
Input block, is suitable for inputting described Web page picture to swt algorithms library;
Acquiring unit, is suitable for obtaining the coordinate parameters of described swt algorithms library output;
Interception unit, is suitable for intercepting the character area piece in described Web page picture according to described coordinate parameters.
B12, according to the device described in B9-B11 any one, also comprise: visit capacity acquisition module, is suitable for obtaining the visit capacity of website in Preset Time section under described webpage, using described visit capacity as coupling input item;
Described matching module is specifically suitable for: judge whether the word in described Web page picture comprises key word, and judge whether described visit capacity is less than preset value; If judge, word in described Web page picture comprises key word and described visit capacity is less than preset value, and detecting described webpage is suspicious fishing webpage.
B13, according to the device described in B9-B11 any one, also comprise: record information acquisition module, is suitable for obtaining the record information of website under described webpage, using described record information as coupling input item.
Described matching module is specifically suitable for: judge whether the word in described Web page picture comprises key word, and under judging described webpage, whether website has enterprise's record information; If judge word in described Web page picture comprise key word and described webpage under website do not there is enterprise's record information, detecting described webpage is suspicious fishing webpage.
B14, according to the device described in B9-B11 any one, also comprise: webpage identification module, is suitable for adopting the device of identifying based on URL and/or IP and/or web page contents to identify described webpage, using recognition result as coupling input item;
Described matching module is further adapted for: whether in conjunction with the word in described Web page picture, whether comprise judged result and the described recognition result of key word, detecting described webpage is suspicious fishing webpage.
B15, according to the device described in B10 or B11, also comprise:
Sort module, is suitable for a plurality of samples pictures to classify, and obtains at least one samples pictures under each classification;
Interception module, is suitable for, for each classification, intercepting the character area piece in each samples pictures under this classification;
Extraction module, is suitable for extracting word from character area piece as the training sample set of this classification;
Training module, is suitable for training based on described training sample set, obtains the word identification storehouse that this classifies corresponding.
B16, according to the device described in B14, described word recognition unit is specifically suitable for: based on each, classify word identification storehouse corresponding, the character area piece in Web page picture is carried out to optical character recognition process, obtain the word in described Web page picture.

Claims (10)

1. the method based on the suspicious fishing webpage of word recognition detection, comprising:
The webpage that browser is opened carries out sectional drawing, obtains Web page picture;
Described Web page picture is carried out to identifying processing, obtain the word in described Web page picture;
Using the word in described Web page picture as coupling input item, whether by described coupling input item is mated with preset rules, detecting described webpage is suspicious fishing webpage.
2. method according to claim 1, describedly carries out identifying processing by Web page picture, and the word obtaining in described Web page picture further comprises:
Search the character area piece in described Web page picture;
Character area piece in described Web page picture is carried out to optical character recognition process, obtain the word in described Web page picture.
3. method according to claim 2, described in the character area piece searched in Web page picture further comprise:
To swt algorithms library, input described Web page picture;
Obtain the coordinate parameters of described swt algorithms library output;
According to described coordinate parameters, intercept the character area piece in described Web page picture.
4. according to the method described in claim 1-3 any one, also comprise: obtain the visit capacity of the affiliated website of described webpage in Preset Time section, using described visit capacity as coupling input item;
Whether described by described coupling input item is mated with preset rules, detecting described webpage is that fishing webpage further comprises:
Judge whether the word in described Web page picture comprises key word, and judge whether described visit capacity is less than preset value;
If the word in described Web page picture comprises key word and described visit capacity is less than preset value, detecting described webpage is suspicious fishing webpage.
5. whether according to the method described in claim 1-3 any one, described by described coupling input item is mated with preset rules, detecting described webpage is that fishing webpage further comprises:
Judge whether the word in described Web page picture comprises key word, and under judging described webpage, whether website has enterprise's record information;
If the word in described Web page picture comprises key word and the affiliated website of described webpage does not have enterprise's record information, detecting described webpage is suspicious fishing webpage.
6. according to the method described in claim 1-3 any one, also comprise: adopt and based on URL and/or IP and/or web page contents, to know method for distinguishing described webpage is identified, using recognition result as coupling input item;
It is described by described coupling input item is mated with preset rules, whether be fishing webpage further comprise: whether in conjunction with the word in described Web page picture, whether comprise judged result and the described recognition result of key word, detecting described webpage is suspicious fishing webpage if detecting described webpage.
7. according to the method in claim 2 or 3, at the described webpage that browser is opened, carry out sectional drawing, obtain Web page picture and also comprise before:
A plurality of samples pictures are classified, obtain at least one samples pictures under each classification;
For each classification, intercept the character area piece in each samples pictures under this classification, from character area piece, extract word as the training sample set of this classification, based on described training sample set, train, obtain the word identification storehouse that this classifies corresponding.
8. method according to claim 6, described character area piece in Web page picture is carried out to optical character recognition process, the word obtaining in described Web page picture is specially: the word identification storehouse of classifying corresponding based on each, character area piece in Web page picture is carried out to optical character recognition process, obtain the word in described Web page picture.
9. the device based on the suspicious fishing webpage of word recognition detection, comprising:
Web page picture acquisition module, the webpage that is suitable for browser to open carries out sectional drawing, obtains Web page picture;
Word identification module, is suitable for described Web page picture to carry out identifying processing, obtains the word in described Web page picture;
Matching module, is suitable for using the word in described Web page picture as coupling input item, and whether by described coupling input item is mated with preset rules, detecting described webpage is suspicious fishing webpage.
10. device according to claim 9, described word identification module further comprises:
Search unit, be suitable for searching the character area piece in described Web page picture;
Word recognition unit, is suitable for the character area piece in described Web page picture to carry out optical character recognition process, obtains the word in described Web page picture.
CN201410439878.5A 2014-09-01 2014-09-01 Method and device for detecting suspicious fishing webpage based on character recognition Pending CN104156490A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410439878.5A CN104156490A (en) 2014-09-01 2014-09-01 Method and device for detecting suspicious fishing webpage based on character recognition

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410439878.5A CN104156490A (en) 2014-09-01 2014-09-01 Method and device for detecting suspicious fishing webpage based on character recognition

Publications (1)

Publication Number Publication Date
CN104156490A true CN104156490A (en) 2014-11-19

Family

ID=51881988

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410439878.5A Pending CN104156490A (en) 2014-09-01 2014-09-01 Method and device for detecting suspicious fishing webpage based on character recognition

Country Status (1)

Country Link
CN (1) CN104156490A (en)

Cited By (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104571812A (en) * 2014-12-10 2015-04-29 联想(北京)有限公司 Information processing method and electronic equipment
CN104598902A (en) * 2015-01-29 2015-05-06 百度在线网络技术(北京)有限公司 Method and device for identifying screenshot and browser
CN105574724A (en) * 2015-12-24 2016-05-11 北京奇虎科技有限公司 Safety payment protection method and system, safety application client, and safety server
CN105681907A (en) * 2015-12-30 2016-06-15 中电长城网际系统应用有限公司 Information verification system and method thereof
CN106020662A (en) * 2016-04-29 2016-10-12 广东小天才科技有限公司 Method and device for rapid content searching of mobile terminal
CN106293365A (en) * 2015-05-20 2017-01-04 广州爱九游信息技术有限公司 A kind of method and device obtaining content of pages
CN106357682A (en) * 2016-10-26 2017-01-25 华中科技大学 Phishing website detecting method
CN106599001A (en) * 2015-10-20 2017-04-26 中国电信股份有限公司 Webpage content acquisition method and system
CN106845323A (en) * 2015-12-03 2017-06-13 阿里巴巴集团控股有限公司 A kind of collection method of marking data, device and certificate recognition system
CN107111539A (en) * 2016-07-11 2017-08-29 深圳前海达闼云端智能科技有限公司 The monitoring method and device of terminal device
CN107135220A (en) * 2017-05-08 2017-09-05 北京智能管家科技有限公司 Cheat page detection method, computer equipment and computer-readable recording medium
CN107707561A (en) * 2017-11-01 2018-02-16 北京知道创宇信息技术有限公司 penetration testing method and device
CN108052523A (en) * 2017-11-03 2018-05-18 中国互联网络信息中心 Gambling site recognition methods and system based on convolutional neural networks
CN108563963A (en) * 2018-04-16 2018-09-21 深信服科技股份有限公司 Webpage tamper detection method, device, equipment and computer readable storage medium
CN108564087A (en) * 2018-04-13 2018-09-21 腾讯科技(深圳)有限公司 Risk Identification Method, device, terminal and the storage medium of small advertisement
CN109657088A (en) * 2018-09-30 2019-04-19 阿里巴巴集团控股有限公司 A kind of picture risk checking method, device, equipment and medium
CN110059468A (en) * 2019-04-02 2019-07-26 阿里巴巴集团控股有限公司 A kind of small routine Risk Identification Method and device
CN110309453A (en) * 2018-03-07 2019-10-08 中移(苏州)软件技术有限公司 A kind of webpage jump method and device
CN110619103A (en) * 2019-09-18 2019-12-27 珠海格力电器股份有限公司 Webpage image-text detection method and device and storage medium
CN111027065A (en) * 2019-10-28 2020-04-17 哈尔滨安天科技集团股份有限公司 Lesovirus identification method and device, electronic equipment and storage medium
CN111125584A (en) * 2019-12-23 2020-05-08 杭州安恒信息技术股份有限公司 Public opinion analysis-based consumption rebate platform identification method
CN111814643A (en) * 2020-06-30 2020-10-23 杭州科度科技有限公司 Black and gray URL (Uniform resource locator) identification method and device, electronic equipment and medium
US10984274B2 (en) 2018-08-24 2021-04-20 Seagate Technology Llc Detecting hidden encoding using optical character recognition
CN114095278A (en) * 2022-01-19 2022-02-25 南京明博互联网安全创新研究院有限公司 Phishing website detection method based on mixed feature selection frame

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101937438A (en) * 2009-06-30 2011-01-05 富士通株式会社 Method and device for extracting webpage content
WO2012089005A1 (en) * 2010-12-31 2012-07-05 成都市华为赛门铁克科技有限公司 Method and apparatus for phishing web page detection
CN102902703A (en) * 2012-07-19 2013-01-30 中国人民解放军国防科学技术大学 Network sensitive information-oriented screenshot discovery and locking callback method
CN103810425A (en) * 2012-11-13 2014-05-21 腾讯科技(深圳)有限公司 Method and device for detecting malicious website
CN103986731A (en) * 2014-05-30 2014-08-13 北京奇虎科技有限公司 Method and device for detecting phishing web pages through picture matching

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101937438A (en) * 2009-06-30 2011-01-05 富士通株式会社 Method and device for extracting webpage content
WO2012089005A1 (en) * 2010-12-31 2012-07-05 成都市华为赛门铁克科技有限公司 Method and apparatus for phishing web page detection
CN102902703A (en) * 2012-07-19 2013-01-30 中国人民解放军国防科学技术大学 Network sensitive information-oriented screenshot discovery and locking callback method
CN103810425A (en) * 2012-11-13 2014-05-21 腾讯科技(深圳)有限公司 Method and device for detecting malicious website
CN103986731A (en) * 2014-05-30 2014-08-13 北京奇虎科技有限公司 Method and device for detecting phishing web pages through picture matching

Cited By (35)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104571812A (en) * 2014-12-10 2015-04-29 联想(北京)有限公司 Information processing method and electronic equipment
CN104571812B (en) * 2014-12-10 2020-04-24 联想(北京)有限公司 Information processing method and electronic equipment
CN104598902A (en) * 2015-01-29 2015-05-06 百度在线网络技术(北京)有限公司 Method and device for identifying screenshot and browser
CN106293365A (en) * 2015-05-20 2017-01-04 广州爱九游信息技术有限公司 A kind of method and device obtaining content of pages
CN106293365B (en) * 2015-05-20 2019-11-26 广州爱九游信息技术有限公司 A kind of method and device obtaining content of pages
CN106599001A (en) * 2015-10-20 2017-04-26 中国电信股份有限公司 Webpage content acquisition method and system
CN106845323B (en) * 2015-12-03 2020-04-28 阿里巴巴集团控股有限公司 Marking data collection method and device and certificate identification system
CN106845323A (en) * 2015-12-03 2017-06-13 阿里巴巴集团控股有限公司 A kind of collection method of marking data, device and certificate recognition system
CN105574724A (en) * 2015-12-24 2016-05-11 北京奇虎科技有限公司 Safety payment protection method and system, safety application client, and safety server
CN105681907A (en) * 2015-12-30 2016-06-15 中电长城网际系统应用有限公司 Information verification system and method thereof
CN106020662B (en) * 2016-04-29 2019-04-30 广东小天才科技有限公司 Content method for fast searching and device applied to mobile terminal
CN106020662A (en) * 2016-04-29 2016-10-12 广东小天才科技有限公司 Method and device for rapid content searching of mobile terminal
CN107111539A (en) * 2016-07-11 2017-08-29 深圳前海达闼云端智能科技有限公司 The monitoring method and device of terminal device
CN106357682A (en) * 2016-10-26 2017-01-25 华中科技大学 Phishing website detecting method
CN107135220B (en) * 2017-05-08 2020-06-16 北京儒博科技有限公司 Method for detecting fraudulent web page, computer device and computer readable storage medium
CN107135220A (en) * 2017-05-08 2017-09-05 北京智能管家科技有限公司 Cheat page detection method, computer equipment and computer-readable recording medium
CN107707561B (en) * 2017-11-01 2020-05-19 北京知道创宇信息技术股份有限公司 Penetration testing method and device
CN107707561A (en) * 2017-11-01 2018-02-16 北京知道创宇信息技术有限公司 penetration testing method and device
CN108052523A (en) * 2017-11-03 2018-05-18 中国互联网络信息中心 Gambling site recognition methods and system based on convolutional neural networks
CN110309453A (en) * 2018-03-07 2019-10-08 中移(苏州)软件技术有限公司 A kind of webpage jump method and device
CN108564087A (en) * 2018-04-13 2018-09-21 腾讯科技(深圳)有限公司 Risk Identification Method, device, terminal and the storage medium of small advertisement
CN108564087B (en) * 2018-04-13 2023-05-12 腾讯科技(深圳)有限公司 Risk identification method, device, terminal and storage medium for small advertisements
CN108563963A (en) * 2018-04-16 2018-09-21 深信服科技股份有限公司 Webpage tamper detection method, device, equipment and computer readable storage medium
US10984274B2 (en) 2018-08-24 2021-04-20 Seagate Technology Llc Detecting hidden encoding using optical character recognition
CN109657088A (en) * 2018-09-30 2019-04-19 阿里巴巴集团控股有限公司 A kind of picture risk checking method, device, equipment and medium
CN110059468A (en) * 2019-04-02 2019-07-26 阿里巴巴集团控股有限公司 A kind of small routine Risk Identification Method and device
CN110059468B (en) * 2019-04-02 2023-09-26 创新先进技术有限公司 Applet risk identification method and device
CN110619103A (en) * 2019-09-18 2019-12-27 珠海格力电器股份有限公司 Webpage image-text detection method and device and storage medium
CN111027065A (en) * 2019-10-28 2020-04-17 哈尔滨安天科技集团股份有限公司 Lesovirus identification method and device, electronic equipment and storage medium
CN111027065B (en) * 2019-10-28 2023-09-08 安天科技集团股份有限公司 Leucavirus identification method and device, electronic equipment and storage medium
CN111125584A (en) * 2019-12-23 2020-05-08 杭州安恒信息技术股份有限公司 Public opinion analysis-based consumption rebate platform identification method
CN111125584B (en) * 2019-12-23 2023-06-06 杭州安恒信息技术股份有限公司 Consumer rebate platform identification method based on public opinion analysis
CN111814643A (en) * 2020-06-30 2020-10-23 杭州科度科技有限公司 Black and gray URL (Uniform resource locator) identification method and device, electronic equipment and medium
CN114095278A (en) * 2022-01-19 2022-02-25 南京明博互联网安全创新研究院有限公司 Phishing website detection method based on mixed feature selection frame
CN114095278B (en) * 2022-01-19 2022-05-24 南京明博互联网安全创新研究院有限公司 Phishing website detection method based on mixed feature selection frame

Similar Documents

Publication Publication Date Title
CN104156490A (en) Method and device for detecting suspicious fishing webpage based on character recognition
CN107204960B (en) Webpage identification method and device and server
CN103544436B (en) System and method for distinguishing phishing websites
CN106295333B (en) method and system for detecting malicious code
CN103685307A (en) Method, system, client and server for detecting phishing fraud webpage based on feature library
Buber et al. NLP based phishing attack detection from URLs
CN104899508B (en) A kind of multistage detection method for phishing site and system
CN107659570A (en) Webshell detection methods and system based on machine learning and static and dynamic analysis
Ceschin et al. The need for speed: An analysis of brazilian malware classifiers
Zhao et al. Classifying malicious URLs using gated recurrent neural networks
CN104158828B (en) The method and system of suspicious fishing webpage are identified based on cloud content rule base
CN104143008B (en) The method and device of fishing webpage is detected based on picture match
CN102446255B (en) Method and device for detecting page tamper
CN108038173B (en) Webpage classification method and system and webpage classification equipment
CN103685308A (en) Detection method and system of phishing web pages, client and server
CN104168293A (en) Method and system for recognizing suspicious phishing web page in combination with local content rule base
CN109104421B (en) Website content tampering detection method, device, equipment and readable storage medium
CN103678692A (en) Safety scanning method and device of downloaded file
Liu et al. An efficient multistage phishing website detection model based on the CASE feature framework: Aiming at the real web environment
CN103986731A (en) Method and device for detecting phishing web pages through picture matching
CN113098887A (en) Phishing website detection method based on website joint characteristics
CN104239582A (en) Method and device for identifying phishing webpage based on feature vector model
Deshpande et al. Detection of phishing websites using Machine Learning
CN112532624B (en) Black chain detection method and device, electronic equipment and readable storage medium
CN112328936A (en) Website identification method, device and equipment and computer readable storage medium

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20141119