CN116366338B

CN116366338B - Risk website identification method and device, computer equipment and storage medium

Info

Publication number: CN116366338B
Application number: CN202310334071.4A
Authority: CN
Inventors: 郎宸; 鲁玮克; 樊兴华; 童兆丰; 薛锋
Original assignee: Beijing ThreatBook Technology Co Ltd
Current assignee: Beijing ThreatBook Technology Co Ltd
Priority date: 2023-03-30
Filing date: 2023-03-30
Publication date: 2024-02-06
Anticipated expiration: 2043-03-30
Also published as: CN116366338A

Abstract

The disclosure provides a risk website identification method, a risk website identification device, computer equipment and a storage medium, wherein the risk website identification method comprises the following steps: acquiring a webpage screenshot of a target website, performing trademark matching on the webpage screenshot, and determining a target trademark matched with the webpage screenshot; identifying a text to be detected from image elements contained in the target website under the condition that the screenshot of the page is not matched with the target trademark; determining the risk level of the target website based on the text to be detected; and under the condition that the risk grade is the first risk grade, determining whether the target website is a risk website disguised as a trusted website or not based on the domain name information of the target website and a preset domain name information base.

Description

Risk website identification method and device, computer equipment and storage medium

Technical Field

The disclosure relates to the technical field of computers, and in particular relates to a risk website identification method, a risk website identification device, computer equipment and a storage medium.

Background

Among the numerous types of risk websites, there is a false website which is disguised as a trusted website and which spoofs key information of users, and such websites are also called as "phishing websites", pages of the phishing websites are quite similar to real trusted websites in visual sense, users may mistake the phishing websites as trusted websites, and submit key information such as accounts, passwords and the like in the websites, so that privacy of the users is stolen, and therefore, in a network security scene, precise identification of the phishing websites is very important.

When a phishing website is identified and detected, trademark identification is usually performed on a screenshot of the website, and the website which is identified as matching trademark is determined as the phishing website, but this method cannot identify phishing pages which do not contain trademark, and thus a vulnerability exists.

Disclosure of Invention

The embodiment of the disclosure at least provides a risk website identification method, a risk website identification device, computer equipment and a storage medium.

In a first aspect, an embodiment of the present disclosure provides a risk website identification method, including:

acquiring a webpage screenshot of a target website, performing trademark matching on the webpage screenshot, and determining a target trademark matched with the webpage screenshot;

identifying a text to be detected from image elements contained in the target website under the condition that the screenshot of the page is not matched with the target trademark;

determining the risk level of the target website based on the text to be detected;

and under the condition that the risk grade is the first risk grade, determining whether the target website is a risk website disguised as a trusted website or not based on the domain name information of the target website and a preset domain name information base.

In an optional embodiment, the determining, based on the text to be detected, a risk level of the target website includes:

Performing risk keyword detection on the text to be detected based on a plurality of first risk keywords;

and under the condition that the first risk keyword matched with the text to be detected is detected, determining the risk level of the target website as a first risk level.

In an optional embodiment, the determining whether the target website is a risk website disguised as a trusted website based on the domain name information of the target website and a preset domain name information base includes:

performing blacklist matching on the domain name information of the target website based on a preset blacklist domain name information base;

performing white list matching on the domain name information of the target website based on a preset white list domain name information base under the condition that the domain name information is successfully matched with the black list domain name information;

and under the condition that the domain name information is successfully matched with any one of the white list domain name information in the white list domain name information library, determining that the target website is a risk website disguised as a trusted website.

In an optional implementation manner, the blacklist matching of the domain name information of the target website based on the preset blacklist domain name information base includes:

Searching a target domain name suffix matched with the domain name information from a domain name suffix library in the blacklist domain name information library;

searching a target IP address matched with the IP address corresponding to the domain name information from a network protocol IP address library in the blacklist domain name information library;

and under the condition that the target domain name suffix or the target IP address is found, determining that the domain name information is successfully matched with the blacklist domain name information.

In an alternative embodiment, the method further comprises:

acquiring a digital certificate corresponding to the domain name information under the condition that the target domain name suffix or the target IP address is not found;

searching a target issuing mechanism matched with the issuing mechanism of the digital certificate from a white list issuing mechanism library;

and under the condition that the target issuing mechanism is not found, determining that the domain name information is successfully matched with the blacklist domain name information.

In an optional implementation manner, the performing white list matching on the domain name information of the target website based on the preset white list domain name information base includes:

determining the similarity between the domain name information and each white list domain name information in the white list domain name information library;

And determining that the white list domain name information with similarity higher than a preset threshold value is matched with the domain name information.

In an alternative embodiment, the method further comprises:

under the condition that the screenshot of the page is matched with the target trademark, determining the risk level of the target website as a second risk level; wherein the second risk level is higher than the first risk level;

performing blacklist matching on the domain name information of the target website based on a preset blacklist domain name information base under the condition that the risk level is the second risk level;

and under the condition that the domain name information is successfully matched with the blacklist domain name information, determining that the target website is a risk website disguised as a trusted website.

In an alternative embodiment, before trademark identification is performed on the screenshot, the method further includes:

acquiring a source code of a website to be detected;

and under the condition that the second risk keywords and/or form input function codes are identified in the source codes, the website to be detected is used as the target website.

In a second aspect, an embodiment of the present disclosure further provides a risk website identification apparatus, including:

the acquisition module is used for acquiring a webpage screenshot of a target website, carrying out trademark matching on the webpage screenshot, and determining a target trademark matched with the webpage screenshot;

The identification module is used for identifying a text to be detected from image elements contained in the target website under the condition that the screenshot of the page is not matched with the target trademark;

the first determining module is used for determining the risk level of the target website based on the text to be detected;

and the second determining module is used for determining whether the target website is a risk website disguised as a trusted website or not based on the domain name information of the target website and a preset domain name information base under the condition that the risk level is the first risk level.

In an alternative embodiment, the first determining module is specifically configured to:

In an optional implementation manner, the second determining module is configured to, when determining, based on the domain name information of the target website and a preset domain name information base, whether the target website is a risk website disguised as a trusted website,:

In an optional implementation manner, the second determining module is configured to, when performing blacklist matching on the domain name information of the target website based on a preset blacklist domain name information base:

In an alternative embodiment, the second determining module is further configured to:

In an optional implementation manner, the second determining module is configured to, when performing white list matching on the domain name information of the target website based on a preset white list domain name information base:

In an alternative embodiment, before trademark identification is performed on the screenshot, the obtaining module is further configured to:

acquiring a source code of a website to be detected;

In a third aspect, an optional implementation manner of the disclosure further provides a computer device, a processor, and a memory, where the memory stores machine-readable instructions executable by the processor, and the processor is configured to execute the machine-readable instructions stored in the memory, where the machine-readable instructions, when executed by the processor, perform the steps in the first aspect, or any possible implementation manner of the first aspect, when executed by the processor.

In a fourth aspect, an alternative implementation of the present disclosure further provides a computer readable storage medium having stored thereon a computer program which when executed performs the steps of the first aspect, or any of the possible implementation manners of the first aspect.

The description of the effect of the risk website identification apparatus, the computer device, and the computer-readable storage medium is referred to the description of the risk website identification method, and is not repeated here.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the aspects of the disclosure.

According to the risk website identification method, the risk website identification device, the computer equipment and the storage medium, trademark matching is firstly carried out on a webpage screenshot of a target website, a text to be detected is identified from image elements contained in the target website under the condition that the webpage screenshot is not matched with the target trademark, the risk grade of the target website is determined based on the text to be detected, and whether the target website is a risk website camouflaged into a trusted website or not is determined based on domain name information of the target website and a preset domain name information library under the condition that the risk grade is the first risk grade. According to the method and the device for detecting the risk websites, the text to be detected is extracted from the image elements of the target websites which are not matched with the target trademark, so that the risk grades of the target websites are rated based on the text to be detected in the image elements, and when the risk grades are the first risk grades, the target websites are further identified based on the domain name information of the target websites and a preset domain name information base, so that the risk websites which do not contain the trademark can be detected, and the detection rate of the risk websites is effectively improved.

The foregoing objects, features and advantages of the disclosure will be more readily apparent from the following detailed description of the preferred embodiments taken in conjunction with the accompanying drawings.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings required for the embodiments are briefly described below, which are incorporated in and constitute a part of the specification, these drawings showing embodiments consistent with the present disclosure and together with the description serve to illustrate the technical solutions of the present disclosure. It is to be understood that the following drawings illustrate only certain embodiments of the present disclosure and are therefore not to be considered limiting of its scope, for the person of ordinary skill in the art may admit to other equally relevant drawings without inventive effort.

FIG. 1 illustrates a flow chart of a risk website identification method provided by some embodiments of the present disclosure;

FIG. 2 illustrates a flow chart of another risk website identification method provided by some embodiments of the present disclosure;

FIG. 3 illustrates a schematic diagram of a risk website identification apparatus provided by some embodiments of the present disclosure;

fig. 4 illustrates a schematic diagram of a computer device provided by some embodiments of the present disclosure.

Detailed Description

For the purposes of making the objects, technical solutions and advantages of the embodiments of the present disclosure more apparent, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the drawings in the embodiments of the present disclosure, and it is apparent that the described embodiments are only some embodiments of the present disclosure, but not all embodiments. The components of the disclosed embodiments generally described and illustrated herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present disclosure is not intended to limit the scope of the disclosure, as claimed, but is merely representative of selected embodiments of the disclosure. All other embodiments, which can be made by those skilled in the art based on the embodiments of this disclosure without making any inventive effort, are intended to be within the scope of this disclosure.

According to research, in the existing identification method for the risk website (namely, phishing website) disguised as the trusted website, only the webpage screenshot of the website is usually subjected to trademark matching, whether the website contains a trademark or not is judged, the website with the recognized trademark can be directly determined to be the risk website disguised as the trusted website corresponding to the trademark, and if the trademark is not recognized, the website is not taken as the phishing website. However, some phishing websites may not use trademark, but confuse visitors through text and pictographic icons, in which case the phishing websites cannot be identified, resulting in the risk of data leakage for users.

Based on the above study, the disclosure provides a risk website identification method, by extracting a text to be detected from an image element of a target website which is not matched with a target trademark, so that a risk grade of the target website is graded based on the text to be detected in the image element, and when the risk grade is a first risk grade, the target website is further identified based on domain name information of the target website and a preset domain name information base, so that a risk website which does not contain the trademark can be detected, and the detection rate of the risk website is effectively improved.

The present invention is directed to a method for manufacturing a semiconductor device, and a semiconductor device manufactured by the method.

It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures.

For the sake of understanding the present embodiment, first, a detailed description will be given of a risk website identification method disclosed in the present embodiment, where an execution subject of the risk website identification method provided in the present embodiment is generally a computer device with a certain computing capability, where the computer device includes, for example: a terminal device or server or other processing device. In some possible implementations, the risk website identification method may be implemented by way of a processor invoking computer readable instructions stored in a memory.

The risk website identification method provided by the embodiment of the present disclosure is described below by taking an execution subject as a terminal device.

Referring to fig. 1, a flowchart of a risk website identification method according to an embodiment of the present disclosure is shown, where the method includes steps S101 to S104, where:

s101, acquiring a webpage screenshot of a target website, performing trademark matching on the webpage screenshot, and determining a target trademark matched with the webpage screenshot.

In this step, the target website may be a pre-screened website, and when risk detection needs to be performed on a website, some simple tests may be performed on the website, for example, the domain name of the website may be detected, whether the domain name of the website is a domain name in a white list may be determined, if the domain name of the website is found in the white list, the website may be considered as a trusted website, and if the domain name cannot be found in the white list, further detection may be performed on the domain name.

For the identification of phishing websites, some specialized detection can be performed, for example, the phishing websites are usually disguised as trusted websites to fraudster key information such as user account numbers and passwords, in order to collect the key information, the phishing websites are usually provided with a form input function and are matched with some guide words, so that the user inputs and submits the key information such as the account numbers and the passwords through the form input function, and therefore whether the websites provide the form input function and whether keywords matched with the key information exist can be detected.

For example, when the risk website is required to be identified, the source code of the website to be detected may be obtained first, the source code may be detected, whether the source code has a preset keyword and/or form input function code or not may be determined, and when the second risk keyword or form input function code is identified in the source code, the website to be detected may be used as the target website of the suspected phishing website.

In order to improve the accuracy of the preliminary screening, the website to be detected can be used as a target website of the suspected phishing website under the condition that the second risk keywords are identified at the same time and the form input function codes are input.

The second risk keywords may include keywords related to spoofing key information such as passwords, logins, payments, accounts, and the like.

The screenshot of the page can be a screenshot of a first screen page of the target website, when trademark matching is carried out on the screenshot of the page, key points can be searched from the screenshot of the page, characteristics of the key points are calculated, and trademark matching is carried out according to the characteristics of the key points. In order to match the trademark after the operations such as zooming and position rotation are performed on the picture, features of the key points may be extracted by using a Scale-invariant feature transform (Scale-Invariant Feature Transform, SIFT) or other methods, and the extracted features of the key points may be directions of the key points. After the characteristics of the key points are extracted, the characteristics of the key points and trademark pictures in the white list trademark library can be utilized to match the key points one by one, if the matching degree reaches a certain threshold value, the matching of the webpage screenshot to the trademark picture can be determined, and the matched trademark picture is used as a target trademark.

The trademark may be a logo, logo or logo of a trusted object collected in advance, and may also be referred to as logo.

And S102, identifying a text to be detected from image elements contained in the target website under the condition that the screenshot of the page is not matched with the target trademark.

In general, when the screenshot does not match the target trademark, the target website may be directly used as a trusted website, but some phishing websites may not use the target trademark but be disguised as trusted websites by means of text, pictograms, and the like, which may cause missed detection of the risk website, so that further detection is required for the target website which does not match the target trademark.

In this step, the image element in the target website may be acquired, the text contained in the image element may be identified, and as the text to be detected, any manner of identifying the text from the image may be used, such as optical character recognition (Optical Character Recognition, OCR) and the like. In the process of identifying the text to be detected for the image elements, the webpage screenshot of the website image can be directly utilized for identification, or the image asset can be extracted from the source code or the resource file of the target website, and the image asset is identified.

S103, determining the risk level of the target website based on the text to be detected.

After the text to be detected is obtained, risk keywords of the text to be detected can be detected, and whether the text to be detected contains the first risk keywords is judged.

The range of the first risk keywords may be greater than the range of the second risk keywords, and may include keywords related to phishing websites such as "finance," "securities," "social security," "identity cards," "account numbers," "passwords," and well-known brand names.

And under the condition that the first risk keyword matched with the text to be detected is detected, determining the risk level of the target website as the first risk level. For example, the first risk level may be a risk level of stroke, above the first risk level, a second risk level may be included, below the second risk level, a third risk level may be included, which may be a low risk level, or no risk.

And S104, determining whether the target website is a risk website disguised as a trusted website or not based on the domain name information of the target website and a preset domain name information base under the condition that the risk level is the first risk level.

When the risk level of the target website is detected to be the first risk level, the target website can be initially judged to have a certain degree of risk, possibly phishing websites, and further detection can be performed to determine whether the target website is the phishing website.

Specifically, further risk detection can be performed on the domain name information of the target website, so as to determine whether the domain name information of the target website is matched with the information in the domain name information base.

The domain name information base can be divided into a white list domain name information base and a black list domain name information base, and if the domain name information is matched with the data in the black list domain name information base, the target website can be directly used as a risk website; if the matching of the blacklist domain name information base and the domain name information fails, the matching of the whitelist domain name information base can be performed, if the similarity of the domain name information and the domain name information in the whitelist domain name information base is higher, the domain name information of the target website is imitated the domain name information in the whitelist domain name information base, the probability that the target website is a phishing website is higher, and the target website can be input as a result of the phishing website.

For example, the blacklist matching may be performed on the domain name information of the target website based on a preset blacklist domain name information base, if the domain name information is successfully matched to the blacklist domain name information, the whitelist matching may be performed on the domain name information of the target website based on a preset whitelist domain name information base, and if the domain name information is successfully matched to any one of the whitelist domain name information in the whitelist domain name information base, the target website is determined to be a risk website disguised as a trusted website.

In one possible implementation, the blacklist domain name information repository may include a domain name suffix repository and a network protocol (Internet Protocol, IP) address repository. Some common domain name suffixes for phishing websites, such as. Xyz, & tk, & GA, & ML, etc., may be included in the domain name suffix library. Unlike common trusted domain name suffixes, such as. Com,. Cn, etc., domain name suffixes in a domain name suffix library are often easier to obtain and use. The IP address library contains some confirmed risk IP addresses, and the IP address of the type is associated with a domain name. When accessing a website, the IP address corresponding to the website domain name is generally queried through a domain name system (Domain Name System, DNS) and accessed by using the IP address.

Specifically, the top-level domain of the target website, namely the domain name suffix, can be extracted from the domain name information, and then the target domain name suffix matched with the domain name information is searched from the domain name suffix library in the blacklist domain name information library; searching an IP address corresponding to the domain name of the target website from a DNS database, and searching a target IP address matched with the IP address corresponding to the domain name information from an IP address library in a blacklist domain name information library.

Furthermore, the digital certificate of the target website can be authenticated, the digital certificate carries the digital signature of the issuing mechanism, and whether the issuing mechanism is a trusted mechanism can be judged by verifying the digital signature.

Specifically, a target issuing mechanism matched with the issuing mechanism of the digital certificate can be searched from a white list issuing mechanism library, if the target issuing mechanism is not searched, the risk of a target website is indicated, and the fact that the domain name information is successfully matched with the black list domain name information can also be determined.

The matching of the issuing mechanism of the digital certificate can be performed simultaneously with the step of matching the domain name suffix and the IP address, or can be performed after the step of matching the domain name suffix and the IP address, so long as the target website meets one of the three conditions, the successful matching of the domain name information to the blacklist domain name information can be determined, and the matching of the whitelist domain name information is performed.

In a possible implementation manner, the digital certificate corresponding to the domain name information can be obtained again to match the digital certificate under the condition that the target domain name suffix or the target IP address is not found. Thus, the issuing authority of the digital signature may not be matched when the advance matches the destination domain name suffix or the destination IP address.

When the white list matching is carried out on the domain name information of the target website, the similarity between the domain name information and each white list domain name information in a white list domain name information base can be determined; and determining that the white list domain name information with the similarity higher than a preset threshold value is matched with the domain name information.

In this way, by matching the first keyword, the blacklist domain name information and the whitelist domain name information, accurate risk identification can be performed on the target website with medium risk, and whether the target website is a risk website disguised as a trusted website can be determined.

For the situation that the screenshot of the page is successfully matched with the target trademark, the risk grade of the target website can be determined to be a second risk grade, such as a high risk grade, in the situation that the confidence degree of the target website is higher than that of the phishing website, the blacklist domain name information library can be directly utilized to carry out blacklist matching on the domain name information of the target website, and in the situation that the domain name information is successfully matched with the blacklist domain name information, the target website is directly determined to be a risk website camouflaged into a trusted website, so that the calculation amount of risk identification is reduced.

Referring to fig. 2, a flowchart of another risk website identification method according to an embodiment of the disclosure is shown. The method comprises the steps of firstly carrying out webpage source code detection on a website to be detected, carrying out trademark logo image detection on a screenshot of the website to be detected by utilizing a white list logo library under the condition that a form function and a second keyword are identified, carrying out webpage image recognition when matched logos are detected, identifying characters in a webpage, carrying out matching by utilizing related second keywords, carrying out risk detection on a domain name when the second keyword is matched, and judging whether the website to be detected is a phishing website according to a risk detection result of the domain name.

According to the risk website identification method provided by the embodiment of the disclosure, trademark matching is firstly carried out on a webpage screenshot of a target website, a text to be detected is identified from image elements contained in the target website under the condition that the webpage screenshot is not matched with the target trademark, the risk grade of the target website is determined based on the text to be detected, and whether the target website is a risk website camouflaged into a trusted website or not is determined based on domain name information of the target website and a preset domain name information base under the condition that the risk grade is the first risk grade.

According to the method and the device for detecting the risk websites, the text to be detected is extracted from the image elements of the target websites which are not matched with the target trademark, so that the risk grades of the target websites are rated based on the text to be detected in the image elements, and when the risk grades are the first risk grades, the target websites are further identified based on the domain name information of the target websites and a preset domain name information base, so that the risk websites which do not contain the trademark can be detected, and the detection rate of the risk websites is effectively improved.

It will be appreciated by those skilled in the art that in the above-described method of the specific embodiments, the written order of steps is not meant to imply a strict order of execution but rather should be construed according to the function and possibly inherent logic of the steps.

Based on the same inventive concept, the embodiments of the present disclosure further provide a risk website identification device corresponding to the risk website identification method, and since the principle of solving the problem by the device in the embodiments of the present disclosure is similar to that of the risk website identification method in the embodiments of the present disclosure, the implementation of the device may refer to the implementation of the method, and the repetition is omitted.

Referring to fig. 3, a schematic diagram of a risk website identification apparatus provided in an embodiment of the present disclosure is shown, where the apparatus includes:

the acquiring module 310 is configured to acquire a screenshot of a target website, perform trademark matching on the screenshot, and determine a target trademark matched with the screenshot;

the identifying module 320 is configured to identify a text to be detected from image elements included in the target website, where the screenshot of the page does not match the target trademark;

a first determining module 330, configured to determine a risk level of the target website based on the text to be detected;

the second determining module 340 is configured to determine, based on domain name information of the target website and a preset domain name information base, whether the target website is a risk website disguised as a trusted website, if the risk level is the first risk level.

In an alternative embodiment, the first determining module 330 is specifically configured to:

In an alternative embodiment, the second determining module 340 is configured to, when determining, based on the domain name information of the target website and a preset domain name information base, whether the target website is a risk website disguised as a trusted website:

In an alternative embodiment, the second determining module 340 is configured to, when performing blacklist matching on the domain name information of the target website based on a preset blacklist domain name information base:

In an alternative embodiment, the second determining module 340 is further configured to:

In an alternative embodiment, the second determining module 340 is configured to, when performing white list matching on the domain name information of the target website based on a preset white list domain name information base:

In an alternative embodiment, before trademark identification is performed on the screenshot, the obtaining module 310 is further configured to:

acquiring a source code of a website to be detected;

The process flow of each module in the apparatus and the interaction flow between the modules may be described with reference to the related descriptions in the above method embodiments, which are not described in detail herein.

The embodiment of the disclosure further provides a computer device, as shown in fig. 4, which is a schematic structural diagram of the computer device provided by the embodiment of the disclosure, including:

a processor 41 and a memory 42; the memory 42 stores machine readable instructions executable by the processor 41, the processor 41 being configured to execute the machine readable instructions stored in the memory 42, the machine readable instructions when executed by the processor 41, the processor 41 performing the steps of:

In an alternative embodiment, in the step executed by the processor 41, the determining, based on the text to be detected, a risk level of the target website includes:

In an alternative embodiment, in the step executed by the processor 41, the determining, based on the domain name information of the target website and a preset domain name information base, whether the target website is a risk website disguised as a trusted website includes:

In an optional embodiment, in the step executed by the processor 41, the performing blacklist matching on the domain name information of the target website based on the preset blacklist domain name information base includes:

In an alternative embodiment, the steps executed by the processor 41 further include:

In an optional embodiment, in the step executed by the processor 41, the performing white list matching on the domain name information of the target website based on the preset white list domain name information base includes:

In an alternative embodiment, before trademark identification is performed on the screenshot, the steps executed by the processor 41 further include:

acquiring a source code of a website to be detected;

The memory 42 includes a memory 421 and an external memory 422; the memory 421 is also referred to as an internal memory, and is used for temporarily storing operation data in the processor 41 and data exchanged with the external memory 422 such as a hard disk, and the processor 41 exchanges data with the external memory 422 via the memory 421.

The specific execution process of the above instruction may refer to the steps of the risk website identification method described in the embodiments of the present disclosure, which is not described herein.

The disclosed embodiments also provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the risk website identification method described in the method embodiments above. Wherein the storage medium may be a volatile or nonvolatile computer readable storage medium.

The embodiments of the present disclosure further provide a computer program product, where the computer program product carries program code, where instructions included in the program code may be used to perform the steps of the risk website identification method described in the foregoing method embodiments, and specifically reference may be made to the foregoing method embodiments, which are not described herein.

Wherein the above-mentioned computer program product may be realized in particular by means of hardware, software or a combination thereof. In an alternative embodiment, the computer program product is embodied as a computer storage medium, and in another alternative embodiment, the computer program product is embodied as a software product, such as a software development kit (Software Development Kit, SDK), or the like.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described system and apparatus may refer to corresponding procedures in the foregoing method embodiments, which are not described herein again. In the several embodiments provided in the present disclosure, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. The above-described apparatus embodiments are merely illustrative, for example, the division of the units is merely a logical function division, and there may be other manners of division in actual implementation, and for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some communication interface, device or unit indirect coupling or communication connection, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present disclosure may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer readable storage medium executable by a processor. Based on such understanding, the technical solution of the present disclosure may be embodied in essence or a part contributing to the prior art or a part of the technical solution, or in the form of a software product stored in a storage medium, including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method described in the embodiments of the present disclosure. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

Finally, it should be noted that: the foregoing examples are merely specific embodiments of the present disclosure, and are not intended to limit the scope of the disclosure, but the present disclosure is not limited thereto, and those skilled in the art will appreciate that while the foregoing examples are described in detail, it is not limited to the disclosure: any person skilled in the art, within the technical scope of the disclosure of the present disclosure, may modify or easily conceive changes to the technical solutions described in the foregoing embodiments, or make equivalent substitutions for some of the technical features thereof; such modifications, changes or substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the disclosure, and are intended to be included within the scope of the present disclosure. Therefore, the protection scope of the present disclosure shall be subject to the protection scope of the claims.

Claims

1. A risk website identification method, comprising:

identifying a text to be detected from image elements contained in the target website under the condition that the screenshot of the page is not matched with the target trademark; when the image element is used for identifying the text to be detected, the page screenshot is used for identifying, or an image asset extracted from a source code or a resource file of the target website is used for identifying the image asset;

under the condition that a first risk keyword matched with the text to be detected is detected, determining the risk level of the target website as a first risk level;

determining whether the target website is a risk website disguised as a trusted website or not based on the domain name information of the target website and a preset domain name information base under the condition that the risk level is the first risk level;

before trademark identification is performed on the screenshot, the method further comprises:

Acquiring a source code of a website to be detected;

under the condition that a second risk keyword and/or a form input function code is identified in the source code, trademark logo image detection is carried out on the screenshot of the website to be detected by using a white list logo library;

when the matched logo is detected, carrying out webpage image recognition, recognizing characters in a webpage, carrying out matching by using related second keywords, carrying out risk detection of a domain name when the second keywords are matched, and judging whether a website to be detected is a target website according to a risk detection result of the domain name;

the determining whether the target website is a risk website disguised as a trusted website based on the domain name information of the target website and a preset domain name information base comprises the following steps:

2. The method according to claim 1, wherein performing blacklist matching on the domain name information of the target website based on a preset blacklist domain name information base includes:

3. The method according to claim 2, wherein the method further comprises:

4. The method according to claim 1, wherein performing white list matching on the domain name information of the target website based on a preset white list domain name information base includes:

5. The method according to claim 1, wherein the method further comprises:

6. A risk website identification apparatus, comprising:

the identification module is used for identifying a text to be detected from image elements contained in the target website under the condition that the screenshot of the page is not matched with the target trademark; when the image element is used for identifying the text to be detected, the page screenshot is used for identifying, or an image asset extracted from a source code or a resource file of the target website is used for identifying the image asset;

The first determining module is used for detecting the risk keywords of the text to be detected based on a plurality of first risk keywords; under the condition that a first risk keyword matched with the text to be detected is detected, determining the risk level of the target website as a first risk level;

the second determining module is used for determining whether the target website is a risk website disguised as a trusted website or not based on the domain name information of the target website and a preset domain name information base under the condition that the risk level is the first risk level;

before trademark identification is performed on the screenshot, the obtaining module is further configured to:

acquiring a source code of a website to be detected;

The second determining module is specifically configured to:

7. A computer device, comprising: a processor, a memory storing machine-readable instructions executable by the processor for executing the machine-readable instructions stored in the memory, which when executed by the processor, perform the steps of the risk website identification method of any one of claims 1 to 5.

8. A computer readable storage medium, characterized in that the computer readable storage medium has stored thereon a computer program which, when run by a computer device, performs the steps of the risk website identification method according to any one of claims 1 to 5.