CN114124564A - Counterfeit website detection method and device, electronic equipment and storage medium - Google Patents

Counterfeit website detection method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN114124564A
CN114124564A CN202111464708.9A CN202111464708A CN114124564A CN 114124564 A CN114124564 A CN 114124564A CN 202111464708 A CN202111464708 A CN 202111464708A CN 114124564 A CN114124564 A CN 114124564A
Authority
CN
China
Prior art keywords
website
counterfeit
page
screenshot
counterfeited
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111464708.9A
Other languages
Chinese (zh)
Other versions
CN114124564B (en
Inventor
江军
王炜
陈世武
杨渝
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Topsec Technology Co Ltd
Beijing Topsec Network Security Technology Co Ltd
Beijing Topsec Software Co Ltd
Original Assignee
Beijing Topsec Technology Co Ltd
Beijing Topsec Network Security Technology Co Ltd
Beijing Topsec Software Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Topsec Technology Co Ltd, Beijing Topsec Network Security Technology Co Ltd, Beijing Topsec Software Co Ltd filed Critical Beijing Topsec Technology Co Ltd
Priority to CN202111464708.9A priority Critical patent/CN114124564B/en
Publication of CN114124564A publication Critical patent/CN114124564A/en
Application granted granted Critical
Publication of CN114124564B publication Critical patent/CN114124564B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1441Countermeasures against malicious traffic
    • H04L63/1483Countermeasures against malicious traffic service impersonation, e.g. phishing, pharming or web spoofing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/14Network analysis or design
    • H04L41/145Network analysis or design involving simulating, designing, planning or modelling of a network
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1425Traffic logging, e.g. anomaly detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • General Engineering & Computer Science (AREA)
  • Signal Processing (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Computer Hardware Design (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Image Analysis (AREA)

Abstract

The embodiment of the application provides a counterfeit website detection method and device, electronic equipment and a storage medium, and relates to the technical field of artificial intelligence. The method comprises the steps of identifying key feature areas of a counterfeited website by using a preset counterfeited website image database to generate a mask and key images; constructing a counterfeit website fingerprint library by using the key image and a preset counterfeit website detection model; the method comprises the steps of detecting a website to be detected by using the mask, the counterfeit website fingerprint library and the counterfeit website detection model to determine whether the website to be detected is a counterfeit website, automatically detecting by using a page image key feature area identification technology based on a deep learning algorithm, improving detection accuracy and stability, and solving the problems that the existing method needs manual detection and has low accuracy.

Description

Counterfeit website detection method and device, electronic equipment and storage medium
Technical Field
The application relates to the technical field of artificial intelligence, in particular to a counterfeit website detection method and device, electronic equipment and a storage medium.
Background
Most of the traditional counterfeit website detection technologies adopt means such as manual detection, blacklist detection, domain name feature detection and the like, the detection accuracy is not high, and a large amount of human resources are required to be invested. There are also some methods, such as comparing the similarity of two web page icons based on image color and image texture, and judging whether the feature of the picture is extracted based on too simple or by adopting a numerical calculation method, and the extracted feature has lower hierarchy and simpler feature, so that the accuracy of the detection result is lower.
Disclosure of Invention
The embodiment of the application aims to provide a counterfeit website detection method, a counterfeit website detection device, an electronic device and a storage medium, wherein a page image key feature region identification technology is utilized, automatic detection is carried out based on a deep learning algorithm, the detection accuracy and stability are improved, and the problems that manual detection is needed and the accuracy is low in the existing method are solved.
The embodiment of the application provides a detection method for counterfeit websites, which comprises the following steps:
identifying a key feature area of the counterfeited website by using a preset counterfeited website image database to generate a mask and a key image;
constructing a counterfeit website fingerprint library by using the key image and a preset counterfeit website detection model;
and detecting the website to be detected by using the mask, the counterfeit website fingerprint library and the counterfeit website detection model so as to determine whether the website to be detected is a counterfeit website.
In the implementation process, the influence of special conditions such as the page updating of the counterfeited website, abnormal website errors and the like on the detection model is greatly reduced by establishing the counterfeited website image database; by adopting the page image key feature area identification technology, the interference of the high-frequency transformed dynamic page content on website analysis can be reduced, the website detection model can focus key features more, and the robustness and accuracy of the model are improved; the deep learning technology is introduced into the counterfeit website detection application scene, and the problem of low detection accuracy of the traditional counterfeit website detection technology is solved.
Further, before the step of identifying key feature regions of the counterfeited website by using the preset counterfeited website image database, the method further comprises the following steps of:
acquiring website domain names of a plurality of websites and removing duplication to generate a domain name list;
screening page addresses corresponding to the website domain names;
acquiring a website page corresponding to the page address, and performing screenshot on the website page to obtain a page screenshot;
and constructing a counterfeited website image database by using the domain name list, the page address and the page screenshot, and regularly updating the page screenshot.
In the implementation process, the domain name, the page address and the page image of the website which is possibly counterfeited form a counterfeited website image database, and a powerful support is provided for detecting the counterfeited website by establishing the counterfeited website database with wide coverage, authority and accuracy.
Further, the identifying key feature regions of the counterfeited website by using a preset counterfeited website image database to generate a mask and a key image comprises:
acquiring any page screenshots with preset number in the counterfeited website image database;
acquiring a color value at any pixel point of each page screenshot;
when the occurrence frequency of the color value mode is greater than a first preset threshold value, recording a set formed by corresponding pixel coordinates and the color value mode, wherein the set is represented as:
A={(x,y,clr)|0≤x<W,0≤y<H};
wherein A represents the set, (x, y) represents offset coordinates of any pixel point relative to the lower left corner of the page screenshot, clr represents the color value mode, W represents the width of the page screenshot, and H represents the height of the page screenshot;
calculating the distance between the offset coordinates of any two pixel points in the set;
calculating the number of neighbor points of each pixel point according to the distance, and if the distance between the two pixel points is smaller than a second preset threshold, the two pixel points are the neighbor points;
if the number of the neighbor points is less than a third preset threshold value, deleting the pixel points from the set;
forming a mask by the elements in the set which are not deleted;
and generating a key image corresponding to each page address by using the mask.
In the implementation process, a statistical method is adopted, and an image mask technology is provided so as to generate a key image by using a mask.
Further, the generating a key image corresponding to each page address by using the mask includes:
and filling color values corresponding to the offset coordinates at offset coordinate positions in the mask on a blank image to generate a key image, wherein the size of the blank image is the same as that of the page screenshot.
In the implementation process, the mask is used for generating the key image of the counterfeited website, so that the interference on the detection of the counterfeited website caused by the conditions of dynamic webpage content, website page updating, website page fault and the like is reduced, and the detection accuracy of the counterfeited website is improved.
Further, before the step of constructing a counterfeit website fingerprint library by using the key image and a preset counterfeit website detection model, the method further comprises the steps of:
acquiring a first website page screenshot of a counterfeit website and a corresponding second website page screenshot of the counterfeit website by using preset counterfeit website blacklist data to generate a training data set;
inputting the training data set into a ResNeXt-101 model for model training;
optimizing the model, wherein the optimization target is expressed as:
Figure BDA0003390862330000031
wherein, 0<i is less than or equal to | T |, where | T | is the logarithm of the first website page screenshot and the second website page screenshot contained in the training data set, c is a parameter of the ResNeXt-101 model, c is the optimal solution of c, and FR is the optimal solution of the first website page screenshot and the second website page screenshotiAnd FFiAnd respectively corresponding first output data and second output data of the ith pair of the first website page screenshot and the second website page screenshot.
In the implementation process, the model is trained and optimized, the counterfeit website detection model is constructed, automatic detection is realized, and the accuracy of the detection result is improved.
Further, the establishing of the counterfeit website fingerprint library by using the key image and a preset counterfeit website detection model includes:
inputting each key image into the counterfeit website detection model to obtain data output;
and outputting the data to form a counterfeit website fingerprint library.
In the implementation process, a counterfeit website fingerprint library is constructed by using the key images and the counterfeit website detection model, and data support for website detection is provided.
Further, the detecting the website to be detected by using the mask, the counterfeit website fingerprint library and the counterfeit website detection model to determine whether the website to be detected is a counterfeit website includes:
searching whether the website domain name of the website to be detected exists in a domain name list of the counterfeited website image database or not;
if not, acquiring a first website screenshot of the website to be detected;
based on the mask, reserving color values of pixels at offset coordinates corresponding to the mask in the screenshot of the website to be tested to generate a second screenshot of the website;
inputting the second website screenshot into the counterfeit website detection model to obtain an output result;
respectively calculating the Euclidean distance between the output result and each data output in the counterfeit website fingerprint database;
if the Euclidean distance is smaller than or equal to the maximum Euclidean distance, judging that the website to be detected is a counterfeit website; the maximum euclidean distance is a maximum value of the euclidean distances of the first output data and the second output data.
In the implementation process, the key image and the counterfeit website detection model are combined, so that the website detection model focuses on key features more, and the robustness and the accuracy of the model are improved.
The embodiment of the present application further provides a counterfeit website detection device, the device includes:
the key image generation module is used for identifying a key feature area of the counterfeited website by utilizing a preset counterfeited website image database so as to generate a mask and a key image;
the fingerprint library construction module is used for constructing a counterfeit website fingerprint library by using the key image and a preset counterfeit website detection model;
and the detection module is used for detecting the website to be detected by utilizing the mask, the counterfeit website fingerprint library and the counterfeit website detection model so as to determine whether the website to be detected is a counterfeit website.
In the implementation process, the influence of special conditions such as the page updating of the counterfeited website, abnormal website errors and the like on the detection model is greatly reduced by establishing the counterfeited website image database; by adopting the page image key feature area identification technology, the interference of the high-frequency transformed dynamic page content on website analysis can be reduced, the website detection model can focus key features more, and the robustness and accuracy of the model are improved; the deep learning technology is introduced into the counterfeit website detection application scene, and the problem of low detection accuracy of the traditional counterfeit website detection technology is solved.
An embodiment of the present application further provides an electronic device, where the electronic device includes a memory and a processor, where the memory is used to store a computer program, and the processor runs the computer program to enable the electronic device to execute any one of the foregoing counterfeit website detection methods.
An embodiment of the present application further provides a readable storage medium, where computer program instructions are stored, and when the computer program instructions are read and executed by a processor, the method for detecting a counterfeit website is performed.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and that those skilled in the art can also obtain other related drawings based on the drawings without inventive efforts.
Fig. 1 is a flowchart of a counterfeit website detection method according to an embodiment of the present application;
fig. 2 is a block diagram of another power supply system according to an embodiment of the present disclosure;
FIG. 3 is a flow chart of the construction of a database of images of counterfeited websites according to an embodiment of the present application;
FIG. 4 is a flow chart of mask and key image generation provided by an embodiment of the present application;
FIG. 5 is a flowchart illustrating a method for constructing a detection model of a counterfeit website according to an embodiment of the present disclosure;
FIG. 6 is a flowchart of a method for constructing a fingerprint library of a counterfeit website according to an embodiment of the present application;
fig. 7 is a flowchart for detecting a website to be detected according to an embodiment of the present application;
fig. 8 is a block diagram illustrating a counterfeit website detection apparatus according to an embodiment of the present disclosure;
fig. 9 is a block diagram of another counterfeit website detection apparatus according to an embodiment of the present application.
Icon:
100-a key image generation module; 110-a database building module; 200-fingerprint database construction module; 210-a model building module; 300-detection module.
Detailed Description
The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures. Meanwhile, in the description of the present application, the terms "first", "second", and the like are used only for distinguishing the description, and are not to be construed as indicating or implying relative importance.
Referring to fig. 1, fig. 1 is a flowchart illustrating a method for detecting a counterfeit website according to an embodiment of the present disclosure. The method specifically comprises the following steps:
step S100: identifying a key feature area of the counterfeited website by using a preset counterfeited website image database to generate a mask and a key image;
as shown in fig. 2, for a specific implementation block diagram of the counterfeit website detection method, before step S100, a counterfeit website image database needs to be constructed first, and as shown in fig. 3, a construction flow chart of the counterfeit website image database specifically includes the following steps:
step S111: acquiring website domain names of a plurality of websites and removing duplication to generate a domain name list;
step S112: screening page addresses corresponding to the website domain names;
step S113: acquiring a website page corresponding to the page address, and performing screenshot on the website page to obtain a page screenshot;
step S114: and constructing a counterfeited website image database by using the domain name list, the page address and the page screenshot, and regularly updating the page screenshot.
Specifically, a domain name of a website which is likely to be counterfeited is acquired, a domain name 2000 before a leader board is acquired by visiting an www.alexa.cn website, a www.alexa.com website is visited, a domain name 500 before Global under a Top sites page is acquired, the 2500 domain names are deduplicated, the domain names after deduplication jointly form a domain name list L of the website which is likely to be counterfeited, the domain name list L is stored in a counterfeited website image database, the number of the websites which are likely to be counterfeited is acquired without limitation, but a sufficient number of websites which are likely to be counterfeited need to be acquired to provide sufficient data support, so that a counterfeited website database with a wide coverage area and accuracy can be established, and powerful support is provided for detection of the counterfeited website.
The page addresses of the websites which are possibly counterfeited are screened, and for each website domain name D in the domain name list L, the website home page URL, the user login page URL and the transaction payment page URL of the website domain name D are manually accessed, screened and confirmed, and if the URLs exist, the URLs are stored in a counterfeited website image database, such as the Taobao website home page URL is https:// www.taobao.com/, the user login page URL is https:// login.
Intercepting a page image of a possibly counterfeited website to obtain a page screenshot: and accessing the URL address in the counterfeited website image database once a day by a web crawler, carrying out screenshot on a website page pointed by the URL address, and storing the screenshot of the page in the counterfeited website image database. The counterfeited website image database only keeps the latest 100-day page screenshots of the corresponding pages of each URL, adds the latest page screenshots of the current day into the counterfeited website image database according to a first-in first-out principle, deletes the page screenshots of 100 days ago, and does not limit the updating period of the page screenshots.
The fake website image database is established by adopting a statistical method, so that the influence of special conditions such as fake website page updating, website abnormal error and the like on the fake website detection model is greatly reduced, and the accuracy of the detection result of the fake website detection model is improved.
As shown in fig. 4, for the mask and key image generation flowchart, on the basis of establishing the counterfeited website image database, identifying the key feature area of the counterfeited website, including three steps of image comparison (steps S101-103), discrete point deletion (steps S104-107) and key image generation, specifically:
step S101: acquiring any page screenshots with preset number in the counterfeited website image database;
step S102: acquiring a color value at any pixel point of each page screenshot;
step S103: when the occurrence frequency of the color value mode is greater than a first preset threshold value, recording a set formed by corresponding pixel coordinates and the color value mode, wherein the set is represented as:
A={(x,y,clr)|0≤x<W,0≤y<H};
wherein A represents the set, (x, y) represents offset coordinates of any pixel point relative to the lower left corner of the page screenshot, clr represents the color value mode, W represents the width of the page screenshot, and H represents the height of the page screenshot;
illustratively, for each URL in the database of images of a spoofed website, 100 screenshots of that URL are read. Suppose (x, y) represents the offset coordinate of any pixel in the screenshot relative to the lower left corner of the page screenshot, W represents the width of the page screenshot, H represents the height of the page screenshot, x is greater than or equal to 0 and less than W, y is greater than or equal to 0 and less than H, and x and y are both non-negative integers.
Counting the color values of 100 pixels of the screenshot at (x, y) pixel by pixel in the whole page screenshot range, recording coordinates (x, y) and a color value mode clr when and only when the frequency of mode occurrence in the 100 color values is greater than a first preset threshold value S, wherein the recorded offset coordinates and the color value mode jointly form a set A { (x, y, clr) |0 ≦ x < W, 0 ≦ y < H }, and 50< S ≦ 100.
For example, W is 1920, H is 1080, the RGB color values of the pixels at (x, y) of the screenshot are counted pixel by pixel in the whole screenshot range of the page, if and only if the number of occurrences of the mode in the 100 RGB color values is greater than the first preset threshold S is 90, the coordinate (x, y) and the RGB color value mode clr (the color value mode refers to the pixel value with the largest occurrence number of occurrences) are recorded, and the recorded coordinate and the RGB color value mode together form the set a { (x, y, clr) |0 ≦ x < W, and 0 ≦ y < H }.
Step S104: calculating the distance between the offset coordinates of any two pixel points in the set;
step S105: calculating the number of neighbor points of each pixel point according to the distance, and if the distance between the two pixel points is smaller than a second preset threshold, the two pixel points are the neighbor points;
step S106: if the number of the neighbor points is less than a third preset threshold value, deleting the pixel points from the set;
step S107: forming a mask by the elements in the set which are not deleted;
illustratively, for each URL, any two coordinates (x) in its set A are calculatedi,yi) And (x)j,yj) Distance DL betweenijIf DL isijLess than the second predetermined threshold DLS, then (x)i,yi) And (x)j,yj) Are neighbors of each other. Calculate each coordinate (x)i,yi) The coordinates of which number is less than the third preset threshold NS are considered as discrete points, the data items corresponding to the coordinates are deleted from the set a, and the data items in the set a which are not deleted together form the mask M. If the number of data items in the mask M is less than the preset threshold MS ═ α × W × H, the URL pointing page may have undergone a significant change, and then the page address of the website that may be counterfeited needs to be re-screened.
By adopting a periodical automatic monitoring mechanism of the counterfeited website page, when the website page is changed (if the number of data items in the mask M is less than a preset threshold value MS ═ alpha × W × H, the URL pointing page may have a significant change), technicians can be automatically reminded to pertinently update the URL address of the counterfeited website page, and the problem of huge workload when an image database of the large-scale counterfeited website page is updated in real time is solved.
For example, each coordinate (x) is calculatedi,yi) The coordinates of which the number of neighboring points is less than the third preset threshold NS of 200 are considered as discrete points, the data items corresponding to the coordinates are deleted from the set a, and the data items in the set a that are not deleted together form the mask M. If the number of data items in the mask M is less than the preset threshold MS ═ α × W × H, and α ═ 0.3, the URL points to a page that may have been significantly changed, and the process goes to the step of performing the screening of the page address of the website that may be counterfeited.
Step S108: and generating a key image corresponding to each page address by using the mask.
Specifically, filling color values corresponding to offset coordinates in a mask at positions corresponding to the offset coordinates on a blank image to generate a key image, wherein the size of the blank image is the same as that of the page screenshot.
For the mask M corresponding to each URL, on a blank image with a width W and a height H (blank area is filled with colorless), the third color value is filled according to the first two coordinates of the data item in the mask M, so as to generate the key image PM corresponding to the URL.
Step S200: constructing a counterfeit website fingerprint library by using the key image and a preset counterfeit website detection model;
before this step, a counterfeit website detection model needs to be constructed, as shown in fig. 5, which is a construction flow chart of the counterfeit website detection model:
step S211: acquiring a first website page screenshot of a counterfeit website and a corresponding second website page screenshot of the counterfeit website by using preset counterfeit website blacklist data to generate a training data set;
step S212: inputting the training data set into a ResNeXt-101 model for model training;
step S213: optimizing the model, wherein the optimization target is expressed as:
Figure BDA0003390862330000101
wherein, 0<i ≦ T |, | T | is the logarithm of the first website page screen shot and the second website page screen shot included in the training data set, e.g., | T | ═ 10000, c is a parameter of the ResNeXt-101 model, c is the optimal solution for c, the FR is the optimal solution for ciAnd FFiFirst output data and second output data corresponding to the i-th screenshot of the first website page and the second screenshot of the website page, and | | FRi-FFi||2Representing the euclidean distance between two multidimensional vectors.
Firstly, a training data set is required to be constructed, the existing blacklist data of the counterfeit website is utilized, a first page screenshot FP of the counterfeit website and a second page screenshot RP of the counterfeit website in a blacklist are visited and intercepted, and the screenshots jointly form a training data set T.
And training the model, taking the ResNeXt-101 model as an initial model, taking paired FP and RP in the training data set T as data input of the model, and training the model.
Optimizing the trained model, wherein the Euclidean distance | | | FR is usedi-FFi||2The maximum value of (1) is denoted as MP.
As shown in fig. 6, for constructing a flowchart of the fingerprint library of the counterfeited website, step S200 may specifically include:
step S201: inputting each key image into the counterfeit website detection model to obtain data output;
step S202: and outputting the data to form a counterfeit website fingerprint library.
Specifically, a key image PM corresponding to each URL in the counterfeit website image database is input as data of a counterfeit website detection model, and data output corresponding to the model is recorded as FM. And all the FMs corresponding to the URLs jointly form a fingerprint library FPS of the counterfeited website, wherein | FPS | is the total number of the FMs in the fingerprint library of the counterfeited website.
Step S300: and detecting the website to be detected by using the mask, the counterfeit website fingerprint library and the counterfeit website detection model so as to determine whether the website to be detected is a counterfeit website.
As shown in fig. 7, a flowchart for detecting a website to be detected specifically includes the following steps:
step S301: searching whether the website domain name of the website to be detected exists in a domain name list of the counterfeited website image database or not;
step S302: if not, acquiring a first website screenshot of the website to be detected;
step S303: based on the mask, reserving color values of pixels at offset coordinates corresponding to the mask in the screenshot of the website to be tested to generate a second screenshot of the website;
step S304: inputting the second website screenshot into the counterfeit website detection model to obtain an output result;
step S305: respectively calculating the Euclidean distance between the output result and each data output in the counterfeit website fingerprint database;
step S306: if the Euclidean distance is smaller than or equal to the maximum Euclidean distance, judging that the website to be detected is a counterfeit website; the maximum euclidean distance is a maximum value of the euclidean distances of the first output data and the second output data.
Specifically, for the website address U to be detected, it is automatically retrieved whether the website domain name already exists in the domain name list L. If the domain name of U does not exist in the domain name list L, the web page pointed to by U is visited and a first web screenshot UP is generated.
For each data item (x) in mask M corresponding to each URL in the image database of the counterfeited websitei,yi,clri) Keeping only UP at coordinate (x)i,yi) At this point, the color values of the other pixels UP are all replaced by non-color values, and the newly generated image is recorded as the second web site screenshot UP'. UP' is used as data input of the model, the output result of the model is recorded as FU, and the FU and each FM in the fingerprint library of the counterfeited website areiAs data input, | | FU is calculated in turn-FMi||2Wherein 0 is<And I is less than or equal to FPS. If | | FU-FMi||2And if not more than MP, judging that the website address U is a suspected counterfeit website address, and the suspected counterfeit website address is the URL corresponding to the i, and handing over the suspected counterfeit website address to a safety specialist for manual verification.
Based on the counterfeit website key image recognition principle, the statistical method is adopted to generate the image mask and the key image of the counterfeit website, so that the interference of the conditions of dynamic webpage content, website page updating, website page fault and the like on counterfeit website detection is reduced, and the counterfeit website detection accuracy is improved.
The detection method adopts the page image key feature area identification technology, can reduce the interference of the high-frequency transformed dynamic page content on website analysis, enables the website detection model to focus on key features more, and improves the robustness and accuracy of the model.
In addition, the counterfeit website detection is performed by using the deep learning model, the model training process is improved by using the model training objective function of the counterfeit website detection, the model detection speed is increased by using a page image detection result cache comparison mechanism, the problem of low detection accuracy of the traditional counterfeit website detection technology is solved, and the counterfeit website is detected by using the machine learning model, so that the efficiency and the automation level are higher compared with the traditional counterfeit website detection technology.
By adopting the page image detection result caching technology, the web page image to be detected only needs to be compared with the cached detection result of the counterfeit web page image, and the speed of web page image retrieval is improved.
An embodiment of the present application further provides a counterfeit website detection apparatus, which is applied to the counterfeit website detection method in the foregoing embodiment, as shown in fig. 8, and is a block diagram of a structure of the counterfeit website detection apparatus, which specifically includes but is not limited to:
a key image generation module 100, configured to identify a key feature region of a counterfeited website by using a preset counterfeited website image database, so as to generate a mask and a key image;
the fingerprint library construction module 200 is used for constructing a counterfeit website fingerprint library by using the key image and a preset counterfeit website detection model;
the detection module 300 is configured to detect the website to be detected by using the mask, the counterfeit website fingerprint library, and the counterfeit website detection model, so as to determine whether the website to be detected is a counterfeit website.
As shown in fig. 9, which is a block diagram of another counterfeit website detection apparatus, the apparatus further includes a database construction module 110, configured to:
acquiring website domain names of a plurality of websites and removing duplication to generate a domain name list;
screening page addresses corresponding to the website domain names;
acquiring a website page corresponding to the page address, and performing screenshot on the website page to obtain a page screenshot;
and constructing a counterfeited website image database by using the domain name list, the page address and the page screenshot, and regularly updating the page screenshot.
Also included is a model building module 210 for:
acquiring a first website page screenshot of a counterfeit website and a corresponding second website page screenshot of the counterfeit website by using preset counterfeit website blacklist data to generate a training data set;
inputting the training data set into a ResNeXt-101 model for model training;
optimizing the model, wherein the optimization target is expressed as:
Figure BDA0003390862330000141
wherein, 0<i is less than or equal to | T |, where | T | is the logarithm of the first website page screenshot and the second website page screenshot contained in the training data set, c is a parameter of the ResNeXt-101 model, c is the optimal solution of c, and FR is the optimal solution of the first website page screenshot and the second website page screenshotiAnd FFiAnd respectively corresponding first output data and second output data of the ith pair of the first website page screenshot and the second website page screenshot.
It should be noted that specific implementation processes of the key image generation module 100, the fingerprint library construction module 200, and the detection module 300 have been described in detail in the method embodiment, and are not described herein again.
The embodiment of the application further provides electronic equipment, which comprises a memory and a processor, wherein the memory is used for storing a computer program, and the processor runs the computer program to enable the electronic equipment to execute the counterfeit website detection method.
An embodiment of the present application further provides a readable storage medium, where computer program instructions are stored in the readable storage medium, and when the computer program instructions are read and executed by a processor, the method for detecting a counterfeit website is executed.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method can be implemented in other ways. The apparatus embodiments described above are merely illustrative, and for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
In addition, functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.
The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The above description is only an example of the present application and is not intended to limit the scope of the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application. It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.
The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

Claims (10)

1. A counterfeit website detection method, comprising:
identifying a key feature area of the counterfeited website by using a preset counterfeited website image database to generate a mask and a key image;
constructing a counterfeit website fingerprint library by using the key image and a preset counterfeit website detection model;
and detecting the website to be detected by using the mask, the counterfeit website fingerprint library and the counterfeit website detection model so as to determine whether the website to be detected is a counterfeit website.
2. The method of claim 1, wherein before the step of identifying key feature regions of the counterfeited website by using a preset database of images of the counterfeited website, the method further comprises constructing a database of images of the counterfeited website:
acquiring website domain names of a plurality of websites and removing duplication to generate a domain name list;
screening page addresses corresponding to the website domain names;
acquiring a website page corresponding to the page address, and performing screenshot on the website page to obtain a page screenshot;
and constructing a counterfeited website image database by using the domain name list, the page address and the page screenshot, and regularly updating the page screenshot.
3. The method for detecting the counterfeit website according to claim 2, wherein the identifying the key feature area of the counterfeit website by using a preset counterfeit website image database to generate the mask and the key image comprises:
acquiring any page screenshots with preset number in the counterfeited website image database;
acquiring a color value at any pixel point of each page screenshot;
when the occurrence frequency of the color value mode is greater than a first preset threshold value, recording a set formed by corresponding pixel coordinates and the color value mode, wherein the set is represented as:
A={(x,y,clr)|0≤x<W,0≤y<H};
wherein A represents the set, (x, y) represents offset coordinates of any pixel point relative to the lower left corner of the page screenshot, clr represents the color value mode, W represents the width of the page screenshot, and H represents the height of the page screenshot;
calculating the distance between the offset coordinates of any two pixel points in the set;
calculating the number of neighbor points of each pixel point according to the distance, and if the distance between the two pixel points is smaller than a second preset threshold, the two pixel points are the neighbor points;
if the number of the neighbor points is less than a third preset threshold value, deleting the pixel points from the set;
forming a mask by the elements in the set which are not deleted;
and generating a key image corresponding to each page address by using the mask.
4. The counterfeit website detection method of claim 3, wherein the generating the key image corresponding to each page address by using the mask comprises:
and filling color values corresponding to the offset coordinates at the offset coordinate position of the mask on a blank image to generate a key image, wherein the size of the blank image is the same as that of the screenshot of the page.
5. The counterfeit website detection method of claim 1, wherein prior to the step of constructing a counterfeit website fingerprint library using the key image and a preset counterfeit website detection model, the method further comprises constructing a counterfeit website detection model:
acquiring a first website page screenshot of a counterfeit website and a corresponding second website page screenshot of the counterfeit website by using preset counterfeit website blacklist data to generate a training data set;
inputting the training data set into a ResNeXt-101 model for model training;
optimizing the model, wherein the optimization target is expressed as:
Figure FDA0003390862320000021
wherein, 0<i is less than or equal to | T |, where | T | is the logarithm of the first website page screenshot and the second website page screenshot contained in the training data set, c is a parameter of the ResNeXt-101 model, c is the optimal solution of c, FR isiAnd FFiAnd respectively corresponding first output data and second output data of the ith pair of the first website page screenshot and the second website page screenshot.
6. The counterfeit website detection method according to claim 5, wherein the constructing a counterfeit website fingerprint library by using the key image and a preset counterfeit website detection model comprises:
inputting each key image into the counterfeit website detection model to obtain data output;
and outputting the data to form a counterfeit website fingerprint library.
7. The method for detecting the counterfeit website according to claim 6, wherein the detecting the website to be detected by using the mask, the fingerprint library of the counterfeit website and the detection model of the counterfeit website to determine whether the website to be detected is the counterfeit website comprises:
searching whether the website domain name of the website to be detected exists in a domain name list of the counterfeited website image database or not;
if not, acquiring a first website screenshot of the website to be detected;
based on the mask, reserving color values of pixels at offset coordinates corresponding to the mask in the screenshot of the website to be tested to generate a second screenshot of the website;
inputting the second website screenshot into the counterfeit website detection model to obtain an output result;
respectively calculating the Euclidean distance between the output result and each data output in the counterfeit website fingerprint database;
if the Euclidean distance is smaller than or equal to the maximum Euclidean distance, judging that the website to be detected is a counterfeit website; the maximum euclidean distance is a maximum value of the euclidean distances of the first output data and the second output data.
8. An counterfeit website detection apparatus, the apparatus comprising:
the key image generation module is used for identifying a key feature area of the counterfeited website by utilizing a preset counterfeited website image database so as to generate a mask and a key image;
the fingerprint library construction module is used for constructing a counterfeit website fingerprint library by using the key image and a preset counterfeit website detection model;
and the detection module is used for detecting the website to be detected by utilizing the mask, the counterfeit website fingerprint library and the counterfeit website detection model so as to determine whether the website to be detected is a counterfeit website.
9. An electronic device, comprising a memory for storing a computer program and a processor for executing the computer program to cause the electronic device to perform the counterfeit website detection method according to any one of claims 1 to 7.
10. A readable storage medium having stored thereon computer program instructions which, when read and executed by a processor, perform the counterfeit website detection method of any one of claims 1 to 7.
CN202111464708.9A 2021-12-03 2021-12-03 Method and device for detecting counterfeit website, electronic equipment and storage medium Active CN114124564B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111464708.9A CN114124564B (en) 2021-12-03 2021-12-03 Method and device for detecting counterfeit website, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111464708.9A CN114124564B (en) 2021-12-03 2021-12-03 Method and device for detecting counterfeit website, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN114124564A true CN114124564A (en) 2022-03-01
CN114124564B CN114124564B (en) 2023-11-28

Family

ID=80365797

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111464708.9A Active CN114124564B (en) 2021-12-03 2021-12-03 Method and device for detecting counterfeit website, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114124564B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115499156A (en) * 2022-07-29 2022-12-20 天翼云科技有限公司 Website background information leakage detection method, electronic device and storage medium

Citations (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103179095A (en) * 2011-12-22 2013-06-26 阿里巴巴集团控股有限公司 Method and client device for detecting phishing websites
CN104166725A (en) * 2014-08-26 2014-11-26 哈尔滨工业大学(威海) Phishing website detection method
CN104462152A (en) * 2013-09-23 2015-03-25 深圳市腾讯计算机系统有限公司 Webpage recognition method and device
CN105119909A (en) * 2015-07-22 2015-12-02 国家计算机网络与信息安全管理中心 Fake website detection method and fake website detection system based on page visual similarity
CN105978850A (en) * 2016-04-08 2016-09-28 中国南方电网有限责任公司 Detection system and detection method for counterfeit website based on graph matching
CN106127042A (en) * 2016-07-06 2016-11-16 苏州仙度网络科技有限公司 Webpage visual similarity recognition method
CN107181730A (en) * 2017-03-13 2017-09-19 烟台中科网络技术研究所 A kind of counterfeit website monitoring recognition methods and system
CN107204956A (en) * 2016-03-16 2017-09-26 腾讯科技(深圳)有限公司 website identification method and device
CN107911360A (en) * 2017-11-13 2018-04-13 哈尔滨工业大学(威海) One kind is hacked website detection method and system
CN108650260A (en) * 2018-05-09 2018-10-12 北京邮电大学 A kind of recognition methods of malicious websites and device
CN108965245A (en) * 2018-05-31 2018-12-07 国家计算机网络与信息安全管理中心 Detection method for phishing site and system based on the more disaggregated models of adaptive isomery
US20190104154A1 (en) * 2017-10-01 2019-04-04 Fireeye, Inc. Phishing attack detection
KR20190099816A (en) * 2018-02-20 2019-08-28 주식회사 디로그 Method and system for detecting counterfeit of web page
US20200389493A1 (en) * 2019-06-07 2020-12-10 Paypal, Inc. Spoofed webpage detection
CN112565250A (en) * 2020-12-04 2021-03-26 中国移动通信集团内蒙古有限公司 Website identification method, device, equipment and storage medium
CN113221032A (en) * 2021-04-08 2021-08-06 北京智奇数美科技有限公司 Link risk detection method, device and storage medium
CN113538629A (en) * 2021-07-30 2021-10-22 上海幻电信息科技有限公司 Detection method and device

Patent Citations (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103179095A (en) * 2011-12-22 2013-06-26 阿里巴巴集团控股有限公司 Method and client device for detecting phishing websites
CN104462152A (en) * 2013-09-23 2015-03-25 深圳市腾讯计算机系统有限公司 Webpage recognition method and device
CN104166725A (en) * 2014-08-26 2014-11-26 哈尔滨工业大学(威海) Phishing website detection method
CN105119909A (en) * 2015-07-22 2015-12-02 国家计算机网络与信息安全管理中心 Fake website detection method and fake website detection system based on page visual similarity
CN107204956A (en) * 2016-03-16 2017-09-26 腾讯科技(深圳)有限公司 website identification method and device
CN105978850A (en) * 2016-04-08 2016-09-28 中国南方电网有限责任公司 Detection system and detection method for counterfeit website based on graph matching
CN106127042A (en) * 2016-07-06 2016-11-16 苏州仙度网络科技有限公司 Webpage visual similarity recognition method
CN107181730A (en) * 2017-03-13 2017-09-19 烟台中科网络技术研究所 A kind of counterfeit website monitoring recognition methods and system
US20190104154A1 (en) * 2017-10-01 2019-04-04 Fireeye, Inc. Phishing attack detection
CN107911360A (en) * 2017-11-13 2018-04-13 哈尔滨工业大学(威海) One kind is hacked website detection method and system
KR20190099816A (en) * 2018-02-20 2019-08-28 주식회사 디로그 Method and system for detecting counterfeit of web page
CN108650260A (en) * 2018-05-09 2018-10-12 北京邮电大学 A kind of recognition methods of malicious websites and device
CN108965245A (en) * 2018-05-31 2018-12-07 国家计算机网络与信息安全管理中心 Detection method for phishing site and system based on the more disaggregated models of adaptive isomery
US20200389493A1 (en) * 2019-06-07 2020-12-10 Paypal, Inc. Spoofed webpage detection
CN112565250A (en) * 2020-12-04 2021-03-26 中国移动通信集团内蒙古有限公司 Website identification method, device, equipment and storage medium
CN113221032A (en) * 2021-04-08 2021-08-06 北京智奇数美科技有限公司 Link risk detection method, device and storage medium
CN113538629A (en) * 2021-07-30 2021-10-22 上海幻电信息科技有限公司 Detection method and device

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115499156A (en) * 2022-07-29 2022-12-20 天翼云科技有限公司 Website background information leakage detection method, electronic device and storage medium

Also Published As

Publication number Publication date
CN114124564B (en) 2023-11-28

Similar Documents

Publication Publication Date Title
Abdelnabi et al. Visualphishnet: Zero-day phishing website detection by visual similarity
US10785241B2 (en) URL attack detection method and apparatus, and electronic device
CN112052787B (en) Target detection method and device based on artificial intelligence and electronic equipment
US20130042306A1 (en) Determining machine behavior
CN112541476B (en) Malicious webpage identification method based on semantic feature extraction
CN110516210B (en) Text similarity calculation method and device
JP7381942B2 (en) Control method, information processing device and control program
CN111224923B (en) Detection method, device and system for counterfeit websites
CN113141276A (en) Knowledge graph-based information security method
CN114124564B (en) Method and device for detecting counterfeit website, electronic equipment and storage medium
CN115392937A (en) User fraud risk identification method and device, electronic equipment and storage medium
Hu et al. Fast source camera identification using matching signs between query and reference fingerprints
CN112990792B (en) Method and device for automatically detecting infringement risk and electronic equipment
CN116432210B (en) File management method and system based on security protection
CN116051118B (en) Analysis method and device of behavior time sequence model
CN112116585A (en) Image removal tampering blind detection method, system, device and storage medium
CN114972956A (en) Target detection model training method, device, equipment and storage medium
CN114528552A (en) Security event correlation method based on vulnerability and related equipment
CN105824871A (en) Picture detecting method and equipment
CN111368624A (en) Loop detection method and device based on generation of countermeasure network
CN117596054B (en) Network security method and system based on dynamic network information security
CN115859292B (en) Fraud-related APP detection system, fraud-related APP judgment method and storage medium
US20220392189A1 (en) Data collection for object detectors
CN116547688A (en) Product providing system, product providing method, and product providing program
US20220377108A1 (en) Method and device for clustering phishing web resources based on visual content image

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant