CN116366338B - Risk website identification method and device, computer equipment and storage medium - Google Patents

Risk website identification method and device, computer equipment and storage medium Download PDF

Info

Publication number
CN116366338B
CN116366338B CN202310334071.4A CN202310334071A CN116366338B CN 116366338 B CN116366338 B CN 116366338B CN 202310334071 A CN202310334071 A CN 202310334071A CN 116366338 B CN116366338 B CN 116366338B
Authority
CN
China
Prior art keywords
domain name
website
name information
target
risk
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310334071.4A
Other languages
Chinese (zh)
Other versions
CN116366338A (en
Inventor
郎宸
鲁玮克
樊兴华
童兆丰
薛锋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing ThreatBook Technology Co Ltd
Original Assignee
Beijing ThreatBook Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing ThreatBook Technology Co Ltd filed Critical Beijing ThreatBook Technology Co Ltd
Priority to CN202310334071.4A priority Critical patent/CN116366338B/en
Publication of CN116366338A publication Critical patent/CN116366338A/en
Application granted granted Critical
Publication of CN116366338B publication Critical patent/CN116366338B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1441Countermeasures against malicious traffic
    • H04L63/1483Countermeasures against malicious traffic service impersonation, e.g. phishing, pharming or web spoofing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means
    • G06V30/19007Matching; Proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/418Document matching, e.g. of document images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/42Document-oriented image-based pattern recognition based on the type of document
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The disclosure provides a risk website identification method, a risk website identification device, computer equipment and a storage medium, wherein the risk website identification method comprises the following steps: acquiring a webpage screenshot of a target website, performing trademark matching on the webpage screenshot, and determining a target trademark matched with the webpage screenshot; identifying a text to be detected from image elements contained in the target website under the condition that the screenshot of the page is not matched with the target trademark; determining the risk level of the target website based on the text to be detected; and under the condition that the risk grade is the first risk grade, determining whether the target website is a risk website disguised as a trusted website or not based on the domain name information of the target website and a preset domain name information base.

Description

Risk website identification method and device, computer equipment and storage medium
Technical Field
The disclosure relates to the technical field of computers, and in particular relates to a risk website identification method, a risk website identification device, computer equipment and a storage medium.
Background
Among the numerous types of risk websites, there is a false website which is disguised as a trusted website and which spoofs key information of users, and such websites are also called as "phishing websites", pages of the phishing websites are quite similar to real trusted websites in visual sense, users may mistake the phishing websites as trusted websites, and submit key information such as accounts, passwords and the like in the websites, so that privacy of the users is stolen, and therefore, in a network security scene, precise identification of the phishing websites is very important.
When a phishing website is identified and detected, trademark identification is usually performed on a screenshot of the website, and the website which is identified as matching trademark is determined as the phishing website, but this method cannot identify phishing pages which do not contain trademark, and thus a vulnerability exists.
Disclosure of Invention
The embodiment of the disclosure at least provides a risk website identification method, a risk website identification device, computer equipment and a storage medium.
In a first aspect, an embodiment of the present disclosure provides a risk website identification method, including:
acquiring a webpage screenshot of a target website, performing trademark matching on the webpage screenshot, and determining a target trademark matched with the webpage screenshot;
identifying a text to be detected from image elements contained in the target website under the condition that the screenshot of the page is not matched with the target trademark;
determining the risk level of the target website based on the text to be detected;
and under the condition that the risk grade is the first risk grade, determining whether the target website is a risk website disguised as a trusted website or not based on the domain name information of the target website and a preset domain name information base.
In an optional embodiment, the determining, based on the text to be detected, a risk level of the target website includes:
Performing risk keyword detection on the text to be detected based on a plurality of first risk keywords;
and under the condition that the first risk keyword matched with the text to be detected is detected, determining the risk level of the target website as a first risk level.
In an optional embodiment, the determining whether the target website is a risk website disguised as a trusted website based on the domain name information of the target website and a preset domain name information base includes:
performing blacklist matching on the domain name information of the target website based on a preset blacklist domain name information base;
performing white list matching on the domain name information of the target website based on a preset white list domain name information base under the condition that the domain name information is successfully matched with the black list domain name information;
and under the condition that the domain name information is successfully matched with any one of the white list domain name information in the white list domain name information library, determining that the target website is a risk website disguised as a trusted website.
In an optional implementation manner, the blacklist matching of the domain name information of the target website based on the preset blacklist domain name information base includes:
Searching a target domain name suffix matched with the domain name information from a domain name suffix library in the blacklist domain name information library;
searching a target IP address matched with the IP address corresponding to the domain name information from a network protocol IP address library in the blacklist domain name information library;
and under the condition that the target domain name suffix or the target IP address is found, determining that the domain name information is successfully matched with the blacklist domain name information.
In an alternative embodiment, the method further comprises:
acquiring a digital certificate corresponding to the domain name information under the condition that the target domain name suffix or the target IP address is not found;
searching a target issuing mechanism matched with the issuing mechanism of the digital certificate from a white list issuing mechanism library;
and under the condition that the target issuing mechanism is not found, determining that the domain name information is successfully matched with the blacklist domain name information.
In an optional implementation manner, the performing white list matching on the domain name information of the target website based on the preset white list domain name information base includes:
determining the similarity between the domain name information and each white list domain name information in the white list domain name information library;
And determining that the white list domain name information with similarity higher than a preset threshold value is matched with the domain name information.
In an alternative embodiment, the method further comprises:
under the condition that the screenshot of the page is matched with the target trademark, determining the risk level of the target website as a second risk level; wherein the second risk level is higher than the first risk level;
performing blacklist matching on the domain name information of the target website based on a preset blacklist domain name information base under the condition that the risk level is the second risk level;
and under the condition that the domain name information is successfully matched with the blacklist domain name information, determining that the target website is a risk website disguised as a trusted website.
In an alternative embodiment, before trademark identification is performed on the screenshot, the method further includes:
acquiring a source code of a website to be detected;
and under the condition that the second risk keywords and/or form input function codes are identified in the source codes, the website to be detected is used as the target website.
In a second aspect, an embodiment of the present disclosure further provides a risk website identification apparatus, including:
the acquisition module is used for acquiring a webpage screenshot of a target website, carrying out trademark matching on the webpage screenshot, and determining a target trademark matched with the webpage screenshot;
The identification module is used for identifying a text to be detected from image elements contained in the target website under the condition that the screenshot of the page is not matched with the target trademark;
the first determining module is used for determining the risk level of the target website based on the text to be detected;
and the second determining module is used for determining whether the target website is a risk website disguised as a trusted website or not based on the domain name information of the target website and a preset domain name information base under the condition that the risk level is the first risk level.
In an alternative embodiment, the first determining module is specifically configured to:
performing risk keyword detection on the text to be detected based on a plurality of first risk keywords;
and under the condition that the first risk keyword matched with the text to be detected is detected, determining the risk level of the target website as a first risk level.
In an optional implementation manner, the second determining module is configured to, when determining, based on the domain name information of the target website and a preset domain name information base, whether the target website is a risk website disguised as a trusted website,:
performing blacklist matching on the domain name information of the target website based on a preset blacklist domain name information base;
Performing white list matching on the domain name information of the target website based on a preset white list domain name information base under the condition that the domain name information is successfully matched with the black list domain name information;
and under the condition that the domain name information is successfully matched with any one of the white list domain name information in the white list domain name information library, determining that the target website is a risk website disguised as a trusted website.
In an optional implementation manner, the second determining module is configured to, when performing blacklist matching on the domain name information of the target website based on a preset blacklist domain name information base:
searching a target domain name suffix matched with the domain name information from a domain name suffix library in the blacklist domain name information library;
searching a target IP address matched with the IP address corresponding to the domain name information from a network protocol IP address library in the blacklist domain name information library;
and under the condition that the target domain name suffix or the target IP address is found, determining that the domain name information is successfully matched with the blacklist domain name information.
In an alternative embodiment, the second determining module is further configured to:
acquiring a digital certificate corresponding to the domain name information under the condition that the target domain name suffix or the target IP address is not found;
Searching a target issuing mechanism matched with the issuing mechanism of the digital certificate from a white list issuing mechanism library;
and under the condition that the target issuing mechanism is not found, determining that the domain name information is successfully matched with the blacklist domain name information.
In an optional implementation manner, the second determining module is configured to, when performing white list matching on the domain name information of the target website based on a preset white list domain name information base:
determining the similarity between the domain name information and each white list domain name information in the white list domain name information library;
and determining that the white list domain name information with similarity higher than a preset threshold value is matched with the domain name information.
In an alternative embodiment, the second determining module is further configured to:
under the condition that the screenshot of the page is matched with the target trademark, determining the risk level of the target website as a second risk level; wherein the second risk level is higher than the first risk level;
performing blacklist matching on the domain name information of the target website based on a preset blacklist domain name information base under the condition that the risk level is the second risk level;
And under the condition that the domain name information is successfully matched with the blacklist domain name information, determining that the target website is a risk website disguised as a trusted website.
In an alternative embodiment, before trademark identification is performed on the screenshot, the obtaining module is further configured to:
acquiring a source code of a website to be detected;
and under the condition that the second risk keywords and/or form input function codes are identified in the source codes, the website to be detected is used as the target website.
In a third aspect, an optional implementation manner of the disclosure further provides a computer device, a processor, and a memory, where the memory stores machine-readable instructions executable by the processor, and the processor is configured to execute the machine-readable instructions stored in the memory, where the machine-readable instructions, when executed by the processor, perform the steps in the first aspect, or any possible implementation manner of the first aspect, when executed by the processor.
In a fourth aspect, an alternative implementation of the present disclosure further provides a computer readable storage medium having stored thereon a computer program which when executed performs the steps of the first aspect, or any of the possible implementation manners of the first aspect.
The description of the effect of the risk website identification apparatus, the computer device, and the computer-readable storage medium is referred to the description of the risk website identification method, and is not repeated here.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the aspects of the disclosure.
According to the risk website identification method, the risk website identification device, the computer equipment and the storage medium, trademark matching is firstly carried out on a webpage screenshot of a target website, a text to be detected is identified from image elements contained in the target website under the condition that the webpage screenshot is not matched with the target trademark, the risk grade of the target website is determined based on the text to be detected, and whether the target website is a risk website camouflaged into a trusted website or not is determined based on domain name information of the target website and a preset domain name information library under the condition that the risk grade is the first risk grade. According to the method and the device for detecting the risk websites, the text to be detected is extracted from the image elements of the target websites which are not matched with the target trademark, so that the risk grades of the target websites are rated based on the text to be detected in the image elements, and when the risk grades are the first risk grades, the target websites are further identified based on the domain name information of the target websites and a preset domain name information base, so that the risk websites which do not contain the trademark can be detected, and the detection rate of the risk websites is effectively improved.
The foregoing objects, features and advantages of the disclosure will be more readily apparent from the following detailed description of the preferred embodiments taken in conjunction with the accompanying drawings.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings required for the embodiments are briefly described below, which are incorporated in and constitute a part of the specification, these drawings showing embodiments consistent with the present disclosure and together with the description serve to illustrate the technical solutions of the present disclosure. It is to be understood that the following drawings illustrate only certain embodiments of the present disclosure and are therefore not to be considered limiting of its scope, for the person of ordinary skill in the art may admit to other equally relevant drawings without inventive effort.
FIG. 1 illustrates a flow chart of a risk website identification method provided by some embodiments of the present disclosure;
FIG. 2 illustrates a flow chart of another risk website identification method provided by some embodiments of the present disclosure;
FIG. 3 illustrates a schematic diagram of a risk website identification apparatus provided by some embodiments of the present disclosure;
fig. 4 illustrates a schematic diagram of a computer device provided by some embodiments of the present disclosure.
Detailed Description
For the purposes of making the objects, technical solutions and advantages of the embodiments of the present disclosure more apparent, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the drawings in the embodiments of the present disclosure, and it is apparent that the described embodiments are only some embodiments of the present disclosure, but not all embodiments. The components of the disclosed embodiments generally described and illustrated herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present disclosure is not intended to limit the scope of the disclosure, as claimed, but is merely representative of selected embodiments of the disclosure. All other embodiments, which can be made by those skilled in the art based on the embodiments of this disclosure without making any inventive effort, are intended to be within the scope of this disclosure.
According to research, in the existing identification method for the risk website (namely, phishing website) disguised as the trusted website, only the webpage screenshot of the website is usually subjected to trademark matching, whether the website contains a trademark or not is judged, the website with the recognized trademark can be directly determined to be the risk website disguised as the trusted website corresponding to the trademark, and if the trademark is not recognized, the website is not taken as the phishing website. However, some phishing websites may not use trademark, but confuse visitors through text and pictographic icons, in which case the phishing websites cannot be identified, resulting in the risk of data leakage for users.
Based on the above study, the disclosure provides a risk website identification method, by extracting a text to be detected from an image element of a target website which is not matched with a target trademark, so that a risk grade of the target website is graded based on the text to be detected in the image element, and when the risk grade is a first risk grade, the target website is further identified based on domain name information of the target website and a preset domain name information base, so that a risk website which does not contain the trademark can be detected, and the detection rate of the risk website is effectively improved.
The present invention is directed to a method for manufacturing a semiconductor device, and a semiconductor device manufactured by the method.
It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures.
For the sake of understanding the present embodiment, first, a detailed description will be given of a risk website identification method disclosed in the present embodiment, where an execution subject of the risk website identification method provided in the present embodiment is generally a computer device with a certain computing capability, where the computer device includes, for example: a terminal device or server or other processing device. In some possible implementations, the risk website identification method may be implemented by way of a processor invoking computer readable instructions stored in a memory.
The risk website identification method provided by the embodiment of the present disclosure is described below by taking an execution subject as a terminal device.
Referring to fig. 1, a flowchart of a risk website identification method according to an embodiment of the present disclosure is shown, where the method includes steps S101 to S104, where:
s101, acquiring a webpage screenshot of a target website, performing trademark matching on the webpage screenshot, and determining a target trademark matched with the webpage screenshot.
In this step, the target website may be a pre-screened website, and when risk detection needs to be performed on a website, some simple tests may be performed on the website, for example, the domain name of the website may be detected, whether the domain name of the website is a domain name in a white list may be determined, if the domain name of the website is found in the white list, the website may be considered as a trusted website, and if the domain name cannot be found in the white list, further detection may be performed on the domain name.
For the identification of phishing websites, some specialized detection can be performed, for example, the phishing websites are usually disguised as trusted websites to fraudster key information such as user account numbers and passwords, in order to collect the key information, the phishing websites are usually provided with a form input function and are matched with some guide words, so that the user inputs and submits the key information such as the account numbers and the passwords through the form input function, and therefore whether the websites provide the form input function and whether keywords matched with the key information exist can be detected.
For example, when the risk website is required to be identified, the source code of the website to be detected may be obtained first, the source code may be detected, whether the source code has a preset keyword and/or form input function code or not may be determined, and when the second risk keyword or form input function code is identified in the source code, the website to be detected may be used as the target website of the suspected phishing website.
In order to improve the accuracy of the preliminary screening, the website to be detected can be used as a target website of the suspected phishing website under the condition that the second risk keywords are identified at the same time and the form input function codes are input.
The second risk keywords may include keywords related to spoofing key information such as passwords, logins, payments, accounts, and the like.
The screenshot of the page can be a screenshot of a first screen page of the target website, when trademark matching is carried out on the screenshot of the page, key points can be searched from the screenshot of the page, characteristics of the key points are calculated, and trademark matching is carried out according to the characteristics of the key points. In order to match the trademark after the operations such as zooming and position rotation are performed on the picture, features of the key points may be extracted by using a Scale-invariant feature transform (Scale-Invariant Feature Transform, SIFT) or other methods, and the extracted features of the key points may be directions of the key points. After the characteristics of the key points are extracted, the characteristics of the key points and trademark pictures in the white list trademark library can be utilized to match the key points one by one, if the matching degree reaches a certain threshold value, the matching of the webpage screenshot to the trademark picture can be determined, and the matched trademark picture is used as a target trademark.
The trademark may be a logo, logo or logo of a trusted object collected in advance, and may also be referred to as logo.
And S102, identifying a text to be detected from image elements contained in the target website under the condition that the screenshot of the page is not matched with the target trademark.
In general, when the screenshot does not match the target trademark, the target website may be directly used as a trusted website, but some phishing websites may not use the target trademark but be disguised as trusted websites by means of text, pictograms, and the like, which may cause missed detection of the risk website, so that further detection is required for the target website which does not match the target trademark.
In this step, the image element in the target website may be acquired, the text contained in the image element may be identified, and as the text to be detected, any manner of identifying the text from the image may be used, such as optical character recognition (Optical Character Recognition, OCR) and the like. In the process of identifying the text to be detected for the image elements, the webpage screenshot of the website image can be directly utilized for identification, or the image asset can be extracted from the source code or the resource file of the target website, and the image asset is identified.
S103, determining the risk level of the target website based on the text to be detected.
After the text to be detected is obtained, risk keywords of the text to be detected can be detected, and whether the text to be detected contains the first risk keywords is judged.
The range of the first risk keywords may be greater than the range of the second risk keywords, and may include keywords related to phishing websites such as "finance," "securities," "social security," "identity cards," "account numbers," "passwords," and well-known brand names.
And under the condition that the first risk keyword matched with the text to be detected is detected, determining the risk level of the target website as the first risk level. For example, the first risk level may be a risk level of stroke, above the first risk level, a second risk level may be included, below the second risk level, a third risk level may be included, which may be a low risk level, or no risk.
And S104, determining whether the target website is a risk website disguised as a trusted website or not based on the domain name information of the target website and a preset domain name information base under the condition that the risk level is the first risk level.
When the risk level of the target website is detected to be the first risk level, the target website can be initially judged to have a certain degree of risk, possibly phishing websites, and further detection can be performed to determine whether the target website is the phishing website.
Specifically, further risk detection can be performed on the domain name information of the target website, so as to determine whether the domain name information of the target website is matched with the information in the domain name information base.
The domain name information base can be divided into a white list domain name information base and a black list domain name information base, and if the domain name information is matched with the data in the black list domain name information base, the target website can be directly used as a risk website; if the matching of the blacklist domain name information base and the domain name information fails, the matching of the whitelist domain name information base can be performed, if the similarity of the domain name information and the domain name information in the whitelist domain name information base is higher, the domain name information of the target website is imitated the domain name information in the whitelist domain name information base, the probability that the target website is a phishing website is higher, and the target website can be input as a result of the phishing website.
For example, the blacklist matching may be performed on the domain name information of the target website based on a preset blacklist domain name information base, if the domain name information is successfully matched to the blacklist domain name information, the whitelist matching may be performed on the domain name information of the target website based on a preset whitelist domain name information base, and if the domain name information is successfully matched to any one of the whitelist domain name information in the whitelist domain name information base, the target website is determined to be a risk website disguised as a trusted website.
In one possible implementation, the blacklist domain name information repository may include a domain name suffix repository and a network protocol (Internet Protocol, IP) address repository. Some common domain name suffixes for phishing websites, such as. Xyz, & tk, & GA, & ML, etc., may be included in the domain name suffix library. Unlike common trusted domain name suffixes, such as. Com,. Cn, etc., domain name suffixes in a domain name suffix library are often easier to obtain and use. The IP address library contains some confirmed risk IP addresses, and the IP address of the type is associated with a domain name. When accessing a website, the IP address corresponding to the website domain name is generally queried through a domain name system (Domain Name System, DNS) and accessed by using the IP address.
Specifically, the top-level domain of the target website, namely the domain name suffix, can be extracted from the domain name information, and then the target domain name suffix matched with the domain name information is searched from the domain name suffix library in the blacklist domain name information library; searching an IP address corresponding to the domain name of the target website from a DNS database, and searching a target IP address matched with the IP address corresponding to the domain name information from an IP address library in a blacklist domain name information library.
And under the condition that the target domain name suffix or the target IP address is found, determining that the domain name information is successfully matched with the blacklist domain name information.
Furthermore, the digital certificate of the target website can be authenticated, the digital certificate carries the digital signature of the issuing mechanism, and whether the issuing mechanism is a trusted mechanism can be judged by verifying the digital signature.
Specifically, a target issuing mechanism matched with the issuing mechanism of the digital certificate can be searched from a white list issuing mechanism library, if the target issuing mechanism is not searched, the risk of a target website is indicated, and the fact that the domain name information is successfully matched with the black list domain name information can also be determined.
The matching of the issuing mechanism of the digital certificate can be performed simultaneously with the step of matching the domain name suffix and the IP address, or can be performed after the step of matching the domain name suffix and the IP address, so long as the target website meets one of the three conditions, the successful matching of the domain name information to the blacklist domain name information can be determined, and the matching of the whitelist domain name information is performed.
In a possible implementation manner, the digital certificate corresponding to the domain name information can be obtained again to match the digital certificate under the condition that the target domain name suffix or the target IP address is not found. Thus, the issuing authority of the digital signature may not be matched when the advance matches the destination domain name suffix or the destination IP address.
When the white list matching is carried out on the domain name information of the target website, the similarity between the domain name information and each white list domain name information in a white list domain name information base can be determined; and determining that the white list domain name information with the similarity higher than a preset threshold value is matched with the domain name information.
In this way, by matching the first keyword, the blacklist domain name information and the whitelist domain name information, accurate risk identification can be performed on the target website with medium risk, and whether the target website is a risk website disguised as a trusted website can be determined.
For the situation that the screenshot of the page is successfully matched with the target trademark, the risk grade of the target website can be determined to be a second risk grade, such as a high risk grade, in the situation that the confidence degree of the target website is higher than that of the phishing website, the blacklist domain name information library can be directly utilized to carry out blacklist matching on the domain name information of the target website, and in the situation that the domain name information is successfully matched with the blacklist domain name information, the target website is directly determined to be a risk website camouflaged into a trusted website, so that the calculation amount of risk identification is reduced.
Referring to fig. 2, a flowchart of another risk website identification method according to an embodiment of the disclosure is shown. The method comprises the steps of firstly carrying out webpage source code detection on a website to be detected, carrying out trademark logo image detection on a screenshot of the website to be detected by utilizing a white list logo library under the condition that a form function and a second keyword are identified, carrying out webpage image recognition when matched logos are detected, identifying characters in a webpage, carrying out matching by utilizing related second keywords, carrying out risk detection on a domain name when the second keyword is matched, and judging whether the website to be detected is a phishing website according to a risk detection result of the domain name.
According to the risk website identification method provided by the embodiment of the disclosure, trademark matching is firstly carried out on a webpage screenshot of a target website, a text to be detected is identified from image elements contained in the target website under the condition that the webpage screenshot is not matched with the target trademark, the risk grade of the target website is determined based on the text to be detected, and whether the target website is a risk website camouflaged into a trusted website or not is determined based on domain name information of the target website and a preset domain name information base under the condition that the risk grade is the first risk grade.
According to the method and the device for detecting the risk websites, the text to be detected is extracted from the image elements of the target websites which are not matched with the target trademark, so that the risk grades of the target websites are rated based on the text to be detected in the image elements, and when the risk grades are the first risk grades, the target websites are further identified based on the domain name information of the target websites and a preset domain name information base, so that the risk websites which do not contain the trademark can be detected, and the detection rate of the risk websites is effectively improved.
It will be appreciated by those skilled in the art that in the above-described method of the specific embodiments, the written order of steps is not meant to imply a strict order of execution but rather should be construed according to the function and possibly inherent logic of the steps.
Based on the same inventive concept, the embodiments of the present disclosure further provide a risk website identification device corresponding to the risk website identification method, and since the principle of solving the problem by the device in the embodiments of the present disclosure is similar to that of the risk website identification method in the embodiments of the present disclosure, the implementation of the device may refer to the implementation of the method, and the repetition is omitted.
Referring to fig. 3, a schematic diagram of a risk website identification apparatus provided in an embodiment of the present disclosure is shown, where the apparatus includes:
the acquiring module 310 is configured to acquire a screenshot of a target website, perform trademark matching on the screenshot, and determine a target trademark matched with the screenshot;
the identifying module 320 is configured to identify a text to be detected from image elements included in the target website, where the screenshot of the page does not match the target trademark;
a first determining module 330, configured to determine a risk level of the target website based on the text to be detected;
the second determining module 340 is configured to determine, based on domain name information of the target website and a preset domain name information base, whether the target website is a risk website disguised as a trusted website, if the risk level is the first risk level.
In an alternative embodiment, the first determining module 330 is specifically configured to:
performing risk keyword detection on the text to be detected based on a plurality of first risk keywords;
and under the condition that the first risk keyword matched with the text to be detected is detected, determining the risk level of the target website as a first risk level.
In an alternative embodiment, the second determining module 340 is configured to, when determining, based on the domain name information of the target website and a preset domain name information base, whether the target website is a risk website disguised as a trusted website:
performing blacklist matching on the domain name information of the target website based on a preset blacklist domain name information base;
performing white list matching on the domain name information of the target website based on a preset white list domain name information base under the condition that the domain name information is successfully matched with the black list domain name information;
and under the condition that the domain name information is successfully matched with any one of the white list domain name information in the white list domain name information library, determining that the target website is a risk website disguised as a trusted website.
In an alternative embodiment, the second determining module 340 is configured to, when performing blacklist matching on the domain name information of the target website based on a preset blacklist domain name information base:
Searching a target domain name suffix matched with the domain name information from a domain name suffix library in the blacklist domain name information library;
searching a target IP address matched with the IP address corresponding to the domain name information from a network protocol IP address library in the blacklist domain name information library;
and under the condition that the target domain name suffix or the target IP address is found, determining that the domain name information is successfully matched with the blacklist domain name information.
In an alternative embodiment, the second determining module 340 is further configured to:
acquiring a digital certificate corresponding to the domain name information under the condition that the target domain name suffix or the target IP address is not found;
searching a target issuing mechanism matched with the issuing mechanism of the digital certificate from a white list issuing mechanism library;
and under the condition that the target issuing mechanism is not found, determining that the domain name information is successfully matched with the blacklist domain name information.
In an alternative embodiment, the second determining module 340 is configured to, when performing white list matching on the domain name information of the target website based on a preset white list domain name information base:
determining the similarity between the domain name information and each white list domain name information in the white list domain name information library;
And determining that the white list domain name information with similarity higher than a preset threshold value is matched with the domain name information.
In an alternative embodiment, the second determining module 340 is further configured to:
under the condition that the screenshot of the page is matched with the target trademark, determining the risk level of the target website as a second risk level; wherein the second risk level is higher than the first risk level;
performing blacklist matching on the domain name information of the target website based on a preset blacklist domain name information base under the condition that the risk level is the second risk level;
and under the condition that the domain name information is successfully matched with the blacklist domain name information, determining that the target website is a risk website disguised as a trusted website.
In an alternative embodiment, before trademark identification is performed on the screenshot, the obtaining module 310 is further configured to:
acquiring a source code of a website to be detected;
and under the condition that the second risk keywords and/or form input function codes are identified in the source codes, the website to be detected is used as the target website.
The process flow of each module in the apparatus and the interaction flow between the modules may be described with reference to the related descriptions in the above method embodiments, which are not described in detail herein.
The embodiment of the disclosure further provides a computer device, as shown in fig. 4, which is a schematic structural diagram of the computer device provided by the embodiment of the disclosure, including:
a processor 41 and a memory 42; the memory 42 stores machine readable instructions executable by the processor 41, the processor 41 being configured to execute the machine readable instructions stored in the memory 42, the machine readable instructions when executed by the processor 41, the processor 41 performing the steps of:
acquiring a webpage screenshot of a target website, performing trademark matching on the webpage screenshot, and determining a target trademark matched with the webpage screenshot;
identifying a text to be detected from image elements contained in the target website under the condition that the screenshot of the page is not matched with the target trademark;
determining the risk level of the target website based on the text to be detected;
and under the condition that the risk grade is the first risk grade, determining whether the target website is a risk website disguised as a trusted website or not based on the domain name information of the target website and a preset domain name information base.
In an alternative embodiment, in the step executed by the processor 41, the determining, based on the text to be detected, a risk level of the target website includes:
Performing risk keyword detection on the text to be detected based on a plurality of first risk keywords;
and under the condition that the first risk keyword matched with the text to be detected is detected, determining the risk level of the target website as a first risk level.
In an alternative embodiment, in the step executed by the processor 41, the determining, based on the domain name information of the target website and a preset domain name information base, whether the target website is a risk website disguised as a trusted website includes:
performing blacklist matching on the domain name information of the target website based on a preset blacklist domain name information base;
performing white list matching on the domain name information of the target website based on a preset white list domain name information base under the condition that the domain name information is successfully matched with the black list domain name information;
and under the condition that the domain name information is successfully matched with any one of the white list domain name information in the white list domain name information library, determining that the target website is a risk website disguised as a trusted website.
In an optional embodiment, in the step executed by the processor 41, the performing blacklist matching on the domain name information of the target website based on the preset blacklist domain name information base includes:
Searching a target domain name suffix matched with the domain name information from a domain name suffix library in the blacklist domain name information library;
searching a target IP address matched with the IP address corresponding to the domain name information from a network protocol IP address library in the blacklist domain name information library;
and under the condition that the target domain name suffix or the target IP address is found, determining that the domain name information is successfully matched with the blacklist domain name information.
In an alternative embodiment, the steps executed by the processor 41 further include:
acquiring a digital certificate corresponding to the domain name information under the condition that the target domain name suffix or the target IP address is not found;
searching a target issuing mechanism matched with the issuing mechanism of the digital certificate from a white list issuing mechanism library;
and under the condition that the target issuing mechanism is not found, determining that the domain name information is successfully matched with the blacklist domain name information.
In an optional embodiment, in the step executed by the processor 41, the performing white list matching on the domain name information of the target website based on the preset white list domain name information base includes:
determining the similarity between the domain name information and each white list domain name information in the white list domain name information library;
And determining that the white list domain name information with similarity higher than a preset threshold value is matched with the domain name information.
In an alternative embodiment, the steps executed by the processor 41 further include:
under the condition that the screenshot of the page is matched with the target trademark, determining the risk level of the target website as a second risk level; wherein the second risk level is higher than the first risk level;
performing blacklist matching on the domain name information of the target website based on a preset blacklist domain name information base under the condition that the risk level is the second risk level;
and under the condition that the domain name information is successfully matched with the blacklist domain name information, determining that the target website is a risk website disguised as a trusted website.
In an alternative embodiment, before trademark identification is performed on the screenshot, the steps executed by the processor 41 further include:
acquiring a source code of a website to be detected;
and under the condition that the second risk keywords and/or form input function codes are identified in the source codes, the website to be detected is used as the target website.
The memory 42 includes a memory 421 and an external memory 422; the memory 421 is also referred to as an internal memory, and is used for temporarily storing operation data in the processor 41 and data exchanged with the external memory 422 such as a hard disk, and the processor 41 exchanges data with the external memory 422 via the memory 421.
The specific execution process of the above instruction may refer to the steps of the risk website identification method described in the embodiments of the present disclosure, which is not described herein.
The disclosed embodiments also provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the risk website identification method described in the method embodiments above. Wherein the storage medium may be a volatile or nonvolatile computer readable storage medium.
The embodiments of the present disclosure further provide a computer program product, where the computer program product carries program code, where instructions included in the program code may be used to perform the steps of the risk website identification method described in the foregoing method embodiments, and specifically reference may be made to the foregoing method embodiments, which are not described herein.
Wherein the above-mentioned computer program product may be realized in particular by means of hardware, software or a combination thereof. In an alternative embodiment, the computer program product is embodied as a computer storage medium, and in another alternative embodiment, the computer program product is embodied as a software product, such as a software development kit (Software Development Kit, SDK), or the like.
It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described system and apparatus may refer to corresponding procedures in the foregoing method embodiments, which are not described herein again. In the several embodiments provided in the present disclosure, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. The above-described apparatus embodiments are merely illustrative, for example, the division of the units is merely a logical function division, and there may be other manners of division in actual implementation, and for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some communication interface, device or unit indirect coupling or communication connection, which may be in electrical, mechanical or other form.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in each embodiment of the present disclosure may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer readable storage medium executable by a processor. Based on such understanding, the technical solution of the present disclosure may be embodied in essence or a part contributing to the prior art or a part of the technical solution, or in the form of a software product stored in a storage medium, including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method described in the embodiments of the present disclosure. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
Finally, it should be noted that: the foregoing examples are merely specific embodiments of the present disclosure, and are not intended to limit the scope of the disclosure, but the present disclosure is not limited thereto, and those skilled in the art will appreciate that while the foregoing examples are described in detail, it is not limited to the disclosure: any person skilled in the art, within the technical scope of the disclosure of the present disclosure, may modify or easily conceive changes to the technical solutions described in the foregoing embodiments, or make equivalent substitutions for some of the technical features thereof; such modifications, changes or substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the disclosure, and are intended to be included within the scope of the present disclosure. Therefore, the protection scope of the present disclosure shall be subject to the protection scope of the claims.

Claims (8)

1. A risk website identification method, comprising:
acquiring a webpage screenshot of a target website, performing trademark matching on the webpage screenshot, and determining a target trademark matched with the webpage screenshot;
identifying a text to be detected from image elements contained in the target website under the condition that the screenshot of the page is not matched with the target trademark; when the image element is used for identifying the text to be detected, the page screenshot is used for identifying, or an image asset extracted from a source code or a resource file of the target website is used for identifying the image asset;
performing risk keyword detection on the text to be detected based on a plurality of first risk keywords;
under the condition that a first risk keyword matched with the text to be detected is detected, determining the risk level of the target website as a first risk level;
determining whether the target website is a risk website disguised as a trusted website or not based on the domain name information of the target website and a preset domain name information base under the condition that the risk level is the first risk level;
before trademark identification is performed on the screenshot, the method further comprises:
Acquiring a source code of a website to be detected;
under the condition that a second risk keyword and/or a form input function code is identified in the source code, trademark logo image detection is carried out on the screenshot of the website to be detected by using a white list logo library;
when the matched logo is detected, carrying out webpage image recognition, recognizing characters in a webpage, carrying out matching by using related second keywords, carrying out risk detection of a domain name when the second keywords are matched, and judging whether a website to be detected is a target website according to a risk detection result of the domain name;
the determining whether the target website is a risk website disguised as a trusted website based on the domain name information of the target website and a preset domain name information base comprises the following steps:
performing blacklist matching on the domain name information of the target website based on a preset blacklist domain name information base;
performing white list matching on the domain name information of the target website based on a preset white list domain name information base under the condition that the domain name information is successfully matched with the black list domain name information;
and under the condition that the domain name information is successfully matched with any one of the white list domain name information in the white list domain name information library, determining that the target website is a risk website disguised as a trusted website.
2. The method according to claim 1, wherein performing blacklist matching on the domain name information of the target website based on a preset blacklist domain name information base includes:
searching a target domain name suffix matched with the domain name information from a domain name suffix library in the blacklist domain name information library;
searching a target IP address matched with the IP address corresponding to the domain name information from a network protocol IP address library in the blacklist domain name information library;
and under the condition that the target domain name suffix or the target IP address is found, determining that the domain name information is successfully matched with the blacklist domain name information.
3. The method according to claim 2, wherein the method further comprises:
acquiring a digital certificate corresponding to the domain name information under the condition that the target domain name suffix or the target IP address is not found;
searching a target issuing mechanism matched with the issuing mechanism of the digital certificate from a white list issuing mechanism library;
and under the condition that the target issuing mechanism is not found, determining that the domain name information is successfully matched with the blacklist domain name information.
4. The method according to claim 1, wherein performing white list matching on the domain name information of the target website based on a preset white list domain name information base includes:
Determining the similarity between the domain name information and each white list domain name information in the white list domain name information library;
and determining that the white list domain name information with similarity higher than a preset threshold value is matched with the domain name information.
5. The method according to claim 1, wherein the method further comprises:
under the condition that the screenshot of the page is matched with the target trademark, determining the risk level of the target website as a second risk level; wherein the second risk level is higher than the first risk level;
performing blacklist matching on the domain name information of the target website based on a preset blacklist domain name information base under the condition that the risk level is the second risk level;
and under the condition that the domain name information is successfully matched with the blacklist domain name information, determining that the target website is a risk website disguised as a trusted website.
6. A risk website identification apparatus, comprising:
the acquisition module is used for acquiring a webpage screenshot of a target website, carrying out trademark matching on the webpage screenshot, and determining a target trademark matched with the webpage screenshot;
the identification module is used for identifying a text to be detected from image elements contained in the target website under the condition that the screenshot of the page is not matched with the target trademark; when the image element is used for identifying the text to be detected, the page screenshot is used for identifying, or an image asset extracted from a source code or a resource file of the target website is used for identifying the image asset;
The first determining module is used for detecting the risk keywords of the text to be detected based on a plurality of first risk keywords; under the condition that a first risk keyword matched with the text to be detected is detected, determining the risk level of the target website as a first risk level;
the second determining module is used for determining whether the target website is a risk website disguised as a trusted website or not based on the domain name information of the target website and a preset domain name information base under the condition that the risk level is the first risk level;
before trademark identification is performed on the screenshot, the obtaining module is further configured to:
acquiring a source code of a website to be detected;
under the condition that a second risk keyword and/or a form input function code is identified in the source code, trademark logo image detection is carried out on the screenshot of the website to be detected by using a white list logo library;
when the matched logo is detected, carrying out webpage image recognition, recognizing characters in a webpage, carrying out matching by using related second keywords, carrying out risk detection of a domain name when the second keywords are matched, and judging whether a website to be detected is a target website according to a risk detection result of the domain name;
The second determining module is specifically configured to:
performing blacklist matching on the domain name information of the target website based on a preset blacklist domain name information base;
performing white list matching on the domain name information of the target website based on a preset white list domain name information base under the condition that the domain name information is successfully matched with the black list domain name information;
and under the condition that the domain name information is successfully matched with any one of the white list domain name information in the white list domain name information library, determining that the target website is a risk website disguised as a trusted website.
7. A computer device, comprising: a processor, a memory storing machine-readable instructions executable by the processor for executing the machine-readable instructions stored in the memory, which when executed by the processor, perform the steps of the risk website identification method of any one of claims 1 to 5.
8. A computer readable storage medium, characterized in that the computer readable storage medium has stored thereon a computer program which, when run by a computer device, performs the steps of the risk website identification method according to any one of claims 1 to 5.
CN202310334071.4A 2023-03-30 2023-03-30 Risk website identification method and device, computer equipment and storage medium Active CN116366338B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310334071.4A CN116366338B (en) 2023-03-30 2023-03-30 Risk website identification method and device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310334071.4A CN116366338B (en) 2023-03-30 2023-03-30 Risk website identification method and device, computer equipment and storage medium

Publications (2)

Publication Number Publication Date
CN116366338A CN116366338A (en) 2023-06-30
CN116366338B true CN116366338B (en) 2024-02-06

Family

ID=86919297

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310334071.4A Active CN116366338B (en) 2023-03-30 2023-03-30 Risk website identification method and device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN116366338B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116723051B (en) * 2023-08-07 2023-10-27 北京安天网络安全技术有限公司 Domain name information generation method, device and medium

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103152355A (en) * 2013-03-19 2013-06-12 北京奇虎科技有限公司 Method and system for promoting dangerous website and client device
CN104077396A (en) * 2014-07-01 2014-10-01 清华大学深圳研究生院 Method and device for detecting phishing website
CN106453351A (en) * 2016-10-31 2017-02-22 重庆邮电大学 Financial fishing webpage detection method based on Web page characteristics
CN108737423A (en) * 2018-05-24 2018-11-02 国家计算机网络与信息安全管理中心 Fishing website based on webpage key content similarity analysis finds method and system
CN109274632A (en) * 2017-07-12 2019-01-25 中国移动通信集团广东有限公司 A kind of recognition methods of website and device
CN110677384A (en) * 2019-08-26 2020-01-10 奇安信科技集团股份有限公司 Phishing website detection method and device, storage medium and electronic device
CN114650176A (en) * 2022-03-22 2022-06-21 深圳壹账通智能科技有限公司 Phishing website detection method and device, computer equipment and storage medium
CN115051817A (en) * 2022-01-05 2022-09-13 中国互联网络信息中心 Phishing detection method and system based on multi-mode fusion features

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10728250B2 (en) * 2017-07-31 2020-07-28 International Business Machines Corporation Managing a whitelist of internet domains
US11637863B2 (en) * 2020-04-03 2023-04-25 Paypal, Inc. Detection of user interface imitation

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103152355A (en) * 2013-03-19 2013-06-12 北京奇虎科技有限公司 Method and system for promoting dangerous website and client device
CN104077396A (en) * 2014-07-01 2014-10-01 清华大学深圳研究生院 Method and device for detecting phishing website
CN106453351A (en) * 2016-10-31 2017-02-22 重庆邮电大学 Financial fishing webpage detection method based on Web page characteristics
CN109274632A (en) * 2017-07-12 2019-01-25 中国移动通信集团广东有限公司 A kind of recognition methods of website and device
CN108737423A (en) * 2018-05-24 2018-11-02 国家计算机网络与信息安全管理中心 Fishing website based on webpage key content similarity analysis finds method and system
CN110677384A (en) * 2019-08-26 2020-01-10 奇安信科技集团股份有限公司 Phishing website detection method and device, storage medium and electronic device
CN115051817A (en) * 2022-01-05 2022-09-13 中国互联网络信息中心 Phishing detection method and system based on multi-mode fusion features
CN114650176A (en) * 2022-03-22 2022-06-21 深圳壹账通智能科技有限公司 Phishing website detection method and device, computer equipment and storage medium

Also Published As

Publication number Publication date
CN116366338A (en) 2023-06-30

Similar Documents

Publication Publication Date Title
US10805346B2 (en) Phishing attack detection
US11388193B2 (en) Systems and methods for detecting online fraud
US20200042696A1 (en) Dynamic page similarity measurement
Pan et al. Anomaly based web phishing page detection
US11165793B2 (en) Method and system for detecting credential stealing attacks
US20140325662A1 (en) Protecting against suspect social entities
CN108092963B (en) Webpage identification method and device, computer equipment and storage medium
WO2015074496A1 (en) Identity authentication method and device and storage medium
EP3343870A1 (en) System and method for detecting phishing web pages field of technology
JP2010516007A (en) Method and apparatus for detecting computer fraud
CN104143008A (en) Method and device for detecting phishing webpage based on picture matching
Deshpande et al. Detection of phishing websites using Machine Learning
CN116366338B (en) Risk website identification method and device, computer equipment and storage medium
Rajalingam et al. Prevention of phishing attacks based on discriminative key point features of webpages
Geng et al. Combating phishing attacks via brand identity and authorization features
Wang et al. Verilogo: Proactive phishing detection via logo recognition
CN115840964A (en) Data processing method and device, electronic equipment and computer storage medium
Yao et al. Deep learning for phishing detection
KR101761513B1 (en) Method and system for detecting counterfeit and falsification using image
Yao et al. Logophish: A new two-dimensional code phishing attack detection method
Ghiyamipour Secure graphical password based on cued click points using fuzzy logic
CN106790102A (en) A kind of QR based on URL features yards of phishing recognition methods and system
US11632395B2 (en) Method for detecting webpage spoofing attacks
CN115225328A (en) Page access data processing method and device, electronic equipment and storage medium
CN109218332B (en) Monitoring method for embedded point type phishing website

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant