WO2021060973A1 - A system and method to prevent bot detection - Google Patents

A system and method to prevent bot detection Download PDF

Info

Publication number
WO2021060973A1
WO2021060973A1 PCT/MY2020/050090 MY2020050090W WO2021060973A1 WO 2021060973 A1 WO2021060973 A1 WO 2021060973A1 MY 2020050090 W MY2020050090 W MY 2020050090W WO 2021060973 A1 WO2021060973 A1 WO 2021060973A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
bot
browser
time
server
Prior art date
Application number
PCT/MY2020/050090
Other languages
French (fr)
Inventor
Amru Yusrin Bin AMRUDDIN
Rosnin Bin MUSTAFFA
Muhammad Awis Jamaluddin Bin JOHARI
Mohd Marzuq Ikram Bin MOHD HELMI
Original Assignee
Mimos Berhad
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Mimos Berhad filed Critical Mimos Berhad
Publication of WO2021060973A1 publication Critical patent/WO2021060973A1/en

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/50Network services
    • H04L67/56Provisioning of proxy services

Definitions

  • the present invention relates to a system and method to prevent bot detection by replicating human like browsing behaviour.
  • the present invention provides a Human Like Browsing Behaviour Identity Engine which is in communication with at least one server in a computing network of the system to avoid server firewall or captcha to block or deny access to a third party.
  • Data harvesting has always been a concern for website operators and data publishers.
  • Data harvesting is a same process as data mining or data extraction.
  • Data harvesting is a process of scraping data or information from websites by using an automated bot.
  • data harvesting is the process where a small script, which is also known as a malicious bot, is used to extract data from websites in a large amount for use in other purposes.
  • Data harvesting usually targets a database in particular. Data harvesting collects information through the database and then download each record in the database to be used for any particular purpose.
  • data is extracted from a targeted website.
  • data can be any type of data that the harvester wants, such as a simple text found on the page or within the page’s code, directory information from a retail site, a series of image and video or it could be all of those items at once.
  • data harvesting bot is often been blocked by server firewall or by a captcha system to ensure the security of the data.
  • the data harvesting bot is usually detected by the server because the identity used in the data harvesting bot is similar while accessing the target server.
  • US 7139747 B1 (hereinafter referred to as US 747 B1 Patent) entitled “System and method for distributed web crawling” having a filing date of 3 November 2000 (Patentee: Hewlett-Packard Development Company, L.P., Houston, TX (US)) provides for efficient downloading of data set addresses from among a plurality of host computers, using a plurality of web crawlers.
  • Each web crawler as disclosed in US 747 B1 Patent identifies URLs in data sets downloaded by web crawler, and identifies the host computer identifier within each of such URL.
  • the host computer identifier for each URL is mapped to the web crawler identifier of one of the web crawlers.
  • Each web crawler sends URLs to the other web crawlers for processing, and each web crawler receives URL's from the other web crawlers for processing. Further, each web crawler filters the URLs assigned to it by comparing them against a database of URLs already known by the web crawler and removing the already known URLs. If a URL is not already known to the web crawler, the data set corresponding to the URL is scheduled for downloading.
  • WO 912 A1 Publication entitled “Incrementally retrieving data for objects to provide a desired level of detail” having filing date of 12 August 2014 (Patentee: International Business Machines Corporation” provides a computer-implemented method for retrieving data and metadata according to a data harvesting criteria into an indexed repository.
  • the invention disclosed in the WO 912 A1 Publication also provides a user interface which allows a user to define data harvesting criteria for adjusting a subset of data and metadata on an indexed repository. Further, WO 912 A1 Publication utilizes user interface by defining the data harvesting criteria, the subset of data and metadata on the indexed repository is adjusted according to the data harvesting criteria.
  • US 100 A1 Publication entitled “Consent handling during data harvesting” having a filing date of 18 November 2015 (Patentee: SAP SE, Walldorf (DE)) describes that a method can include receiving social media data associated with a user identifier and a first country code.
  • a stored consent configuration rule can specify whether to store the Social media data anonymously or non-anonymously.
  • the consent configuration rule of US 100 A1 Publication can be associated with a second country code. It can further be determined whether the second country code associated with the consent configuration rule matches the first country code associated with the social media data.
  • the present invention relates to a system and method to prevent bot detection by replicating human like browsing behaviour.
  • a Human Like Browsing Behaviour Identity Engine (102) is in communication with at least one server in a computing network of the system of the present invention.
  • One aspect of the invention provides a system (100) to prevent bot detection by replicating human like browsing behaviour comprising a Human Like Browsing Behaviour Identity Engine (102) in communication with at least one server in a computing network.
  • the Human Like Browsing Behaviour Identity Engine (102) comprises at least one Internet Protocol Number Randomizer (104) for randomly changing data harvesting bot referral Internet Protocol; at least one Time and Data Cache Randomizer (106) for delaying data harvesting bot at a random interval of time; at least one Spiral Keywords and Pages Rotation Module (108) for changing keywords and pages in spiral rotation; and at least one Browser User Agent Randomizer (110) for randomly changing browser header.
  • the at least one Internet Protocol Number Randomizer (104) further comprises a plurality of registered harvest bot proxies.
  • a further aspect of the invention provides that the plurality of registered harvest bot proxies further having its own Internet Protocol address for each proxy.
  • a further aspect of the invention provides a method (400) to prevent bot detection by replicating human like browsing behaviour.
  • the method comprising steps of randomly changing Internet Protocol address of data harvesting bot referral (402); delaying data harvesting bot at a random interval of time (404); changing keywords and pages in a spiral rotation (406); randomly changing header of browser (408); and allowing data harvesting bot to access a targeted server (410).
  • the step for randomly changing Internet Protocol address of data harvesting bot referral further comprises steps of (500) accessing a plurality of harvest bot proxies with its own Internet Protocol address for each proxy (502); and randomly selecting any harvest bot proxy to access a targeted server (504).
  • the step for delaying data harvesting bot at a random interval of time (404) further comprises steps of (600) reading cache expiration time setting (602); checking if data request to target server have been cached (604); if data request to target server have been cached, check difference of cached data time and compare difference of cached data time to cache expiration time setting to determine if cache data time to cache expiration time setting is less then cache expiration time setting (606). If cached data time is less then cache expiration time setting, reusing cached data and new data will not be requested from targeted server (610) else obtaining new data from targeted server and storing new data obtained as new cached data (608). If data request to target server have not been cached, proceed to obtain new data from targeted server and store new data obtained as new cached data (608).
  • the step of randomly changing header of browser further comprises steps of (700) randomizing rendering engine in Browser User Agent Randomizer (702); obtaining randomize version of browser from rendering engine (704); randomizing system and browser information (706); randomizing browser’s platform (708); randomizing details of browser’s platform (710); and randomizing extensions of browser to indicate specific enhancements that are available in browser (712).
  • FIG. 1.0 illustrates the general architecture of the system of the present invention.
  • FIG. 2.0 illustrates the IP number randomizer module of the present invention.
  • FIG. 3.0 illustrates a mechanism of spiral keywords and pages rotation.
  • FIG. 4.0 is a flowchart illustrating the methodology of the present invention.
  • FIG. 5.0 is a flowchart illustrating the steps of randomly changing Internet Protocol address of data harvesting bot referral.
  • FIG. 6.0 is a flowchart illustrating the steps of delaying data harvesting bot at a random interval of time.
  • FIG. 7.0 is a flowchart illustrating the steps of randomly changing header of browser.
  • the present invention relates to a system and method to prevent bot detection by replicating human like browsing behaviour.
  • a Human Like Browsing Behaviour Identity Engine (102) is in communication with at least one server in a computing network of the system of the present invention.
  • FIG. 1.0 illustrates the general system architecture of the present invention while FIG. 2.0 illustrates the IP number randomizer module of the present invention.
  • the system (100) to prevent bot detection by replicating human like browsing behaviour comprising a Human Like Browsing Behaviour Identity Engine (102) in communication with at least one server (112) in a computing network.
  • the Human Like Browsing Behaviour Identity Engine (102) comprises at least one Internet Protocol Number Randomizer (104) for randomly changing data harvesting bot referral Internet Protocol; at least one Time and Data Cache Randomizer (106) for delaying data harvesting bot at a random interval of time; at least one Spiral Keywords and Pages Rotation Module (108) for changing keywords and pages in spiral rotation; and at least one Browser User Agent Randomizer (110) for randomly changing browser header.
  • the at least one Internet Protocol Number Randomizer (104) further comprises a plurality of registered harvest bot proxies whereby the plurality of registered harvest bot proxies further having its own Internet Protocol address for each proxy.
  • the harvest proxy bot is chosen randomly by the data harvesting bot to access the targeted server (112) to generate different IP address for each of the data harvest.
  • the IP address for each of the data harvest is automatically generated based on the number of proxy server whereby the IP address is randomized between IP address of proxy server.
  • spiral keywords and page rotation module is performed to avoid data harvesting bot crawling for the searched keywords in the same targeted server for a long time by changing it from one targeted server to another targeted server in a time frame.
  • a web bot is assigned to grab all data based on a keyword “Malaysia” in google search, yahoo search, bing search, facebook search etc...
  • the page is considered as pagination on a search result in a search engine. If a user were to search on google page the pagination is at the bottom of the search results.
  • Table 1.0 illustrates the display of the results of the currently available method to grab data on the targeted server as Table 1 .0 is a normal web bot method where web bot will grab all data in the same category or source.
  • Table 1.0 in 2 seconds Data Harvesting Bot stays in Web Google server to obtain data containing “Malaysia” in page 1 and 2 before changing to another targeted server such as Blogspot and Facebook. For example, if there are a total of 100 pages, data harvesting bot will stay in the same target server for 100 seconds in which the firewall and captcha system will block the action.
  • Table 2.0 illustrates the display of the results of the method of the present invention to grab data on targeted server.
  • data harvesting bot will not stay in Web Google server in 4 seconds to obtain data containing “Malaysia” in page 1 and 2.
  • the method of the present invention provides for data harvesting bot to jump to other targeted server such as Google, Blogspot, Facebook and Bing with each second at different targeted server with different pages.
  • FIG. 4.0 is a flowchart illustrating the methodology of the present invention.
  • the method (400) to prevent bot detection by replicating human like browsing behaviour comprising steps of first randomly changing Internet Protocol address of data harvesting bot referral (402). Thereafter, data harvesting bot is delayed at a random interval of time (404) and keywords and pages are changed in a spiral rotation (406) while header of browser is changed randomly (408). The delay of data harvesting bot is set on bot setting on the expiry time of the cached file.
  • the step of changing keywords and pages in a spiral rotation (406) further comprises changing data harvesting bot to another targeted server with a different page within a predetermined time set by user.
  • data harvesting bot is allowed to access a targeted server (410)
  • a targeted server 410)
  • the condition provided by public server is that bot is not allowed.
  • Bot is usually detected if the behavior of the requester is from the same IP address which denotes the same server, same browser header, same keywords or page request
  • Randomly changing Internet Protocol address of data harvesting bot referral (402) further comprises steps of (500) accessing a plurality of harvest bot proxies with its own Internet Protocol address for each proxy (502) and randomly selecting any harvest bot proxy to access a targeted server (504).
  • delaying data harvesting bot at a random interval of time further comprises steps of (600) first reading cache expiration setting (602). Thereafter, it is checked if data request to target server have been cached (604). If data request to target server have been cached (606), check difference of cached data time whereby cached data time is current system time up to cached data time and compare difference of cached data time to cache expiration time setting to determine if cache data time is less then cache expiration time setting (606).
  • the step in checking if data request to target server have been cached (604) further comprises delaying time of request of each new data randomly at a range of time predetermined by user.
  • the step of randomly changing header of browser further comprises steps of (700) randomizing rendering engine in Browser User Agent Randomizer (702) whereby the rendering engine refers to Firefox, Chrome, Safari, Opera, or any other browser. Bot can be randomly assigned to grab data through available browser installed in Windows. Thereafter,, randomize version of browser is obtained from rendering engine (704). Subsequently, system and browser information is randomized (706); browser’s platform is randomized (708); details of browser’s platform are randomized (710); and extensions of browser are randomized to indicate specific enhancements that are available in browser (712).
  • the current browser user agent identity is divided into six segments which are rendering engine segment, version segment, system and browser information segment, platform segment, platform details segment and extensions segment. During this step, browser user agent randomizer module (110) will generate randomize string for each segment of user agent identity.
  • the present invention provides a Human Like Browsing Behaviour Identity Engine having an IP number randomizer module, a time and data cache randomizer module, a spiral keywords and page rotation module and a browser user agent randomizer module in the system.
  • the Human Like Browsing Behaviour Identity Engine mimics human like behaviour in web browsing by randomizing IP address by accessing targeted server through IP number randomizer; performing cache content and time delay on accessing targeted server through time and data cache randomizer; rotating pages providing spiral keywords for targeted server through spiral keywords and pages rotation; and generating randomize browser version header through browser header randomizer (140).

Abstract

The present invention provides a system (100) and a method (400) to prevent bot detection by replicating human like browsing behaviour. In particular, a Human Like Browsing Behaviour Identity Engine (102) is in communication with at least one server in a computing network of the system of the present invention. The Human Like Browsing Behaviour Identity Engine (102) comprises at least one Internet Protocol Number Randomizer (104) for randomly changing data harvesting bot referral Internet Protocol; at least one Time and Data Cache Randomizer (106) for delaying data harvesting bot at a random interval of time; at least one Spiral Keywords and Pages Rotation Module (108) for changing keywords and pages in spiral rotation; and at least one Browser User Agent Randomizer (110) for randomly changing browser header.

Description

A SYSTEM AND METHOD TO PREVENT BOT DETECTION
FIELD OF INVENTION
The present invention relates to a system and method to prevent bot detection by replicating human like browsing behaviour. In particular, the present invention provides a Human Like Browsing Behaviour Identity Engine which is in communication with at least one server in a computing network of the system to avoid server firewall or captcha to block or deny access to a third party.
BACKGROUND ART
Data harvesting, or web scraping, has always been a concern for website operators and data publishers. Data harvesting is a same process as data mining or data extraction. Data harvesting is a process of scraping data or information from websites by using an automated bot. In other words, data harvesting is the process where a small script, which is also known as a malicious bot, is used to extract data from websites in a large amount for use in other purposes. Data harvesting usually targets a database in particular. Data harvesting collects information through the database and then download each record in the database to be used for any particular purpose.
In the data harvesting, data is extracted from a targeted website. Such data can be any type of data that the harvester wants, such as a simple text found on the page or within the page’s code, directory information from a retail site, a series of image and video or it could be all of those items at once. Nevertheless, data harvesting bot is often been blocked by server firewall or by a captcha system to ensure the security of the data. The data harvesting bot is usually detected by the server because the identity used in the data harvesting bot is similar while accessing the target server.
United States Patent No. US 7139747 B1 (hereinafter referred to as US 747 B1 Patent) entitled “System and method for distributed web crawling” having a filing date of 3 November 2000 (Patentee: Hewlett-Packard Development Company, L.P., Houston, TX (US)) provides for efficient downloading of data set addresses from among a plurality of host computers, using a plurality of web crawlers. Each web crawler as disclosed in US 747 B1 Patent identifies URLs in data sets downloaded by web crawler, and identifies the host computer identifier within each of such URL. The host computer identifier for each URL is mapped to the web crawler identifier of one of the web crawlers. Each web crawler sends URLs to the other web crawlers for processing, and each web crawler receives URL's from the other web crawlers for processing. Further, each web crawler filters the URLs assigned to it by comparing them against a database of URLs already known by the web crawler and removing the already known URLs. If a URL is not already known to the web crawler, the data set corresponding to the URL is scheduled for downloading.
International Patent Application Publication No. WO 2015/021912 A1 (hereinafter referred to as “WO 912 A1 Publication) entitled “Incrementally retrieving data for objects to provide a desired level of detail” having filing date of 12 August 2014 (Patentee: International Business Machines Corporation” provides a computer-implemented method for retrieving data and metadata according to a data harvesting criteria into an indexed repository. The invention disclosed in the WO 912 A1 Publication also provides a user interface which allows a user to define data harvesting criteria for adjusting a subset of data and metadata on an indexed repository. Further, WO 912 A1 Publication utilizes user interface by defining the data harvesting criteria, the subset of data and metadata on the indexed repository is adjusted according to the data harvesting criteria.
United States Patent Application Publication No. US 2017/0098100 A1 (hereinafter referred to as the US 100 A1 Publication) entitled “Consent handling during data harvesting” having a filing date of 18 November 2015 (Patentee: SAP SE, Walldorf (DE)) describes that a method can include receiving social media data associated with a user identifier and a first country code. A stored consent configuration rule can specify whether to store the Social media data anonymously or non-anonymously. The consent configuration rule of US 100 A1 Publication can be associated with a second country code. It can further be determined whether the second country code associated with the consent configuration rule matches the first country code associated with the social media data.
With reference to the above-mentioned disclosures, there is indeed a need for a system and method that is able to prevent bot detection by replicating human like browsing behaviour. SUMMARY OF INVENTION
The present invention relates to a system and method to prevent bot detection by replicating human like browsing behaviour. In particular, a Human Like Browsing Behaviour Identity Engine (102) is in communication with at least one server in a computing network of the system of the present invention.
One aspect of the invention provides a system (100) to prevent bot detection by replicating human like browsing behaviour comprising a Human Like Browsing Behaviour Identity Engine (102) in communication with at least one server in a computing network. The Human Like Browsing Behaviour Identity Engine (102) comprises at least one Internet Protocol Number Randomizer (104) for randomly changing data harvesting bot referral Internet Protocol; at least one Time and Data Cache Randomizer (106) for delaying data harvesting bot at a random interval of time; at least one Spiral Keywords and Pages Rotation Module (108) for changing keywords and pages in spiral rotation; and at least one Browser User Agent Randomizer (110) for randomly changing browser header.
Another aspect of the invention provides that the at least one Internet Protocol Number Randomizer (104) further comprises a plurality of registered harvest bot proxies.
A further aspect of the invention provides that the plurality of registered harvest bot proxies further having its own Internet Protocol address for each proxy.
A further aspect of the invention provides a method (400) to prevent bot detection by replicating human like browsing behaviour. The method comprising steps of randomly changing Internet Protocol address of data harvesting bot referral (402); delaying data harvesting bot at a random interval of time (404); changing keywords and pages in a spiral rotation (406); randomly changing header of browser (408); and allowing data harvesting bot to access a targeted server (410).
Yet another aspect of the invention provides that the step for randomly changing Internet Protocol address of data harvesting bot referral (402) further comprises steps of (500) accessing a plurality of harvest bot proxies with its own Internet Protocol address for each proxy (502); and randomly selecting any harvest bot proxy to access a targeted server (504).
Still another aspect of the invention provides that the step for delaying data harvesting bot at a random interval of time (404) further comprises steps of (600) reading cache expiration time setting (602); checking if data request to target server have been cached (604); if data request to target server have been cached, check difference of cached data time and compare difference of cached data time to cache expiration time setting to determine if cache data time to cache expiration time setting is less then cache expiration time setting (606). If cached data time is less then cache expiration time setting, reusing cached data and new data will not be requested from targeted server (610) else obtaining new data from targeted server and storing new data obtained as new cached data (608). If data request to target server have not been cached, proceed to obtain new data from targeted server and store new data obtained as new cached data (608).
A further aspect of the invention provides that the step of checking if data request to target server have been cached (604) further comprises delaying time of request of each new data randomly at a range of time predetermined by user. Yet another aspect of the invention provides that the step of changing keywords and pages in a spiral rotation (406) further comprises changing data harvesting bot to another targeted server with a different page within a predetermined time set by user.
Still another aspect of the invention provides that the step of randomly changing header of browser (408) further comprises steps of (700) randomizing rendering engine in Browser User Agent Randomizer (702); obtaining randomize version of browser from rendering engine (704); randomizing system and browser information (706); randomizing browser’s platform (708); randomizing details of browser’s platform (710); and randomizing extensions of browser to indicate specific enhancements that are available in browser (712).
The present invention consists of features and a combination of parts hereinafter fully described and illustrated in the accompanying drawings, it being understood that various changes in the details may be made without departing from the scope of the invention or sacrificing any of the advantages of the present invention. BRIEF DESCRIPTION OF ACCOMPANYING DRAWINGS
To further clarify various aspects of some embodiments of the present invention, a more particular description of the invention will be rendered by references to specific embodiments thereof, which are illustrated in the appended drawings. It is appreciated that these drawings depict only typical embodiments of the invention and are therefore not to be considered limiting of its scope. The invention will be described and explained with additional specificity and detail through the accompanying drawings.
FIG. 1.0 illustrates the general architecture of the system of the present invention.
FIG. 2.0 illustrates the IP number randomizer module of the present invention. FIG. 3.0 illustrates a mechanism of spiral keywords and pages rotation.
FIG. 4.0 is a flowchart illustrating the methodology of the present invention.
FIG. 5.0 is a flowchart illustrating the steps of randomly changing Internet Protocol address of data harvesting bot referral.
FIG. 6.0 is a flowchart illustrating the steps of delaying data harvesting bot at a random interval of time.
FIG. 7.0 is a flowchart illustrating the steps of randomly changing header of browser.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS
The present invention relates to a system and method to prevent bot detection by replicating human like browsing behaviour. In particular, a Human Like Browsing Behaviour Identity Engine (102) is in communication with at least one server in a computing network of the system of the present invention. Hereinafter, this specification will describe the present invention according to the preferred embodiments. It is to be understood that limiting the description to the preferred embodiments of the invention is merely to facilitate discussion of the present invention and it is envisioned without departing from the scope of the appended claims.
Reference is first made to FIG. 1.0 and FIG. 2.0 respectively. FIG. 1.0 illustrates the general system architecture of the present invention while FIG. 2.0 illustrates the IP number randomizer module of the present invention. As illustrated in FIG. 1.0, the system (100) to prevent bot detection by replicating human like browsing behaviour, comprising a Human Like Browsing Behaviour Identity Engine (102) in communication with at least one server (112) in a computing network. The Human Like Browsing Behaviour Identity Engine (102) comprises at least one Internet Protocol Number Randomizer (104) for randomly changing data harvesting bot referral Internet Protocol; at least one Time and Data Cache Randomizer (106) for delaying data harvesting bot at a random interval of time; at least one Spiral Keywords and Pages Rotation Module (108) for changing keywords and pages in spiral rotation; and at least one Browser User Agent Randomizer (110) for randomly changing browser header. As illustrated in FIG. 2.0, the at least one Internet Protocol Number Randomizer (104) further comprises a plurality of registered harvest bot proxies whereby the plurality of registered harvest bot proxies further having its own Internet Protocol address for each proxy. The harvest proxy bot is chosen randomly by the data harvesting bot to access the targeted server (112) to generate different IP address for each of the data harvest. The IP address for each of the data harvest is automatically generated based on the number of proxy server whereby the IP address is randomized between IP address of proxy server.
Reference is now made to FIG. 3.0 which illustrates a mechanism of spiral keywords and pages rotation. As illustrated in FIG. 3.0, spiral keywords and page rotation module is performed to avoid data harvesting bot crawling for the searched keywords in the same targeted server for a long time by changing it from one targeted server to another targeted server in a time frame. For example, a web bot is assigned to grab all data based on a keyword “Malaysia” in google search, yahoo search, bing search, facebook search etc... The page is considered as pagination on a search result in a search engine. If a user were to search on google page the pagination is at the bottom of the search results.
Figure imgf000009_0001
Table 1.0
Figure imgf000009_0002
Table 2.0
Table 1.0 illustrates the display of the results of the currently available method to grab data on the targeted server as Table 1 .0 is a normal web bot method where web bot will grab all data in the same category or source. As illustrated in Table 1.0, in 2 seconds Data Harvesting Bot stays in Web Google server to obtain data containing “Malaysia” in page 1 and 2 before changing to another targeted server such as Blogspot and Facebook. For example, if there are a total of 100 pages, data harvesting bot will stay in the same target server for 100 seconds in which the firewall and captcha system will block the action.
Table 2.0 illustrates the display of the results of the method of the present invention to grab data on targeted server. , As illustrated in Table 2.0, data harvesting bot will not stay in Web Google server in 4 seconds to obtain data containing “Malaysia” in page 1 and 2. The method of the present invention provides for data harvesting bot to jump to other targeted server such as Google, Blogspot, Facebook and Bing with each second at different targeted server with different pages. There is a delay of about 3 seconds for data harvesting bot come to return to the same targeted server. The delay will not cause data harvesting bot to be blocked by firewall or captcha system.
Reference is now made to FIG. 4.0 which is a flowchart illustrating the methodology of the present invention. As illustrated in FIG. 4.0, the method (400) to prevent bot detection by replicating human like browsing behaviour comprising steps of first randomly changing Internet Protocol address of data harvesting bot referral (402). Thereafter, data harvesting bot is delayed at a random interval of time (404) and keywords and pages are changed in a spiral rotation (406) while header of browser is changed randomly (408). The delay of data harvesting bot is set on bot setting on the expiry time of the cached file. The step of changing keywords and pages in a spiral rotation (406) further comprises changing data harvesting bot to another targeted server with a different page within a predetermined time set by user. Subsequently, data harvesting bot is allowed to access a targeted server (410) This is such as all public server such as Google, Yahoo and Facebook allows access for public to use the search engine with a condition to access the server. The condition provided by public server is that bot is not allowed. Bot is usually detected if the behavior of the requester is from the same IP address which denotes the same server, same browser header, same keywords or page request
Reference is now made to FIG. 5.0 which is a flowchart illustrating the steps of randomly changing Internet Protocol address of data harvesting bot referral. Randomly changing Internet Protocol address of data harvesting bot referral (402) further comprises steps of (500) accessing a plurality of harvest bot proxies with its own Internet Protocol address for each proxy (502) and randomly selecting any harvest bot proxy to access a targeted server (504).
Reference is now made to FIG. 6.0 which is a flowchart illustrating the steps of delaying data harvesting bot at a random interval of time. As illustrated in FIG. 6.0, delaying data harvesting bot at a random interval of time (404) further comprises steps of (600) first reading cache expiration setting (602). Thereafter, it is checked if data request to target server have been cached (604). If data request to target server have been cached (606), check difference of cached data time whereby cached data time is current system time up to cached data time and compare difference of cached data time to cache expiration time setting to determine if cache data time is less then cache expiration time setting (606). If cache data time is less then cache expiration time setting, cache data is reused and new data will not be requested from targeted server else if cache data time is more then cache expiration time setting, new data will be obtain from targeted server and new data obtained will be stored as new cache data (608). If data request to target server has not been cached (604), new data will be obtain from targeted server and new data obtained will be stored as new cached data (608). The step in checking if data request to target server have been cached (604) further comprises delaying time of request of each new data randomly at a range of time predetermined by user.
Reference is now made to FIG. 7.0 which is a flowchart illustrating the steps of randomly changing header of browser. As illustrated in FIG. 7.0, the step of randomly changing header of browser (408) further comprises steps of (700) randomizing rendering engine in Browser User Agent Randomizer (702) whereby the rendering engine refers to Firefox, Chrome, Safari, Opera, or any other browser. Bot can be randomly assigned to grab data through available browser installed in Windows. Thereafter,, randomize version of browser is obtained from rendering engine (704). Subsequently, system and browser information is randomized (706); browser’s platform is randomized (708); details of browser’s platform are randomized (710); and extensions of browser are randomized to indicate specific enhancements that are available in browser (712). The current browser user agent identity is divided into six segments which are rendering engine segment, version segment, system and browser information segment, platform segment, platform details segment and extensions segment. During this step, browser user agent randomizer module (110) will generate randomize string for each segment of user agent identity.
The examples of browser user agent randomizer (110) output example from same data harvesting bot is as shown below:
[Rendering Engine]/[version]([system and browser information])[platform]([platform details]) [extensions]
Mozilla/5.0 (iPad; U; CPU OS 3_2_ 1 like Mac OS X; en-us)AppleWebKit/531.21.10 (KHTML, like Gecko) Mobile/7B405
Mosaic/1.0 (compatible; MSIE 9.0; Windows NT 6.0; SV1;.NET CLR 1.0.3705;.NET CLR 1.1.4322) WebKit/100 (Gecko) XBOX
Mozilla/5.0 (iPad; U; CPU OS 3_2_1 like Mac OS X; en-us)AppleWebKit/531.21.10 (KHTML, like Gecko) Mobile/7B405
Mosaic/1.0 (compatible; MSIE 9.0; Windows NT 6.0; SV1;.NET CLR 1.0.3705;.NET CLR 1.1.4322) WebKit/100 (Gecko) XBOX The present invention provides a Human Like Browsing Behaviour Identity Engine having an IP number randomizer module, a time and data cache randomizer module, a spiral keywords and page rotation module and a browser user agent randomizer module in the system. The Human Like Browsing Behaviour Identity Engine mimics human like behaviour in web browsing by randomizing IP address by accessing targeted server through IP number randomizer; performing cache content and time delay on accessing targeted server through time and data cache randomizer; rotating pages providing spiral keywords for targeted server through spiral keywords and pages rotation; and generating randomize browser version header through browser header randomizer (140).
Unless the context requires otherwise or specifically stated to the contrary, integers, steps or elements of the invention recited herein as singular integers, steps or elements clearly encompass both singular and plural forms of the recited integers, steps or elements. Throughout this specification, unless the context requires otherwise, the word “comprise”, or variations such as “comprises” or “comprising”, will be understood to imply the inclusion of a stated step or element or integer or group of steps or elements or integers, but not the exclusion of any other step or element or integer or group of steps, elements or integers. Thus, in the context of this specification, the term “comprising” is used in an inclusive sense and thus should be understood as meaning “including principally, but not necessarily solely”.

Claims

1. A system (100) to prevent bot detection by replicating human like browsing behaviour, comprising: at least one server in a computing network (112); and a Human Like Browsing Behaviour Identity Engine (102) in communication with the at least one server in the computing network, characterized in that, the Human Like Browsing Behaviour Identity Engine
(102) comprises: at least one Internet Protocol Number Randomizer (104) for randomly changing data harvesting bot referral Internet Protocol; at least one Time and Data Cache Randomizer (106) for delaying data harvesting bot at a random interval of time; at least one Spiral Keywords and Pages Rotation Module (108) for changing keywords and pages in spiral rotation; and at least one Browser User Agent Randomizer (110) for randomly changing browser header.
2. The system (100) according to Claim 1 , wherein the at least one Internet Protocol Number Randomizer (104) further comprises a plurality of registered harvest bot proxies.
3. The system (100) according to Claim 2, wherein the plurality of registered harvest bot proxies further having its own Internet Protocol address for each proxy.
4. A method (400) to prevent bot detection by replicating human like browsing behaviour, the method is characterized by the steps of: randomly changing Internet Protocol address of data harvesting bot referral (402); delaying data harvesting bot at a random interval of time (404); changing keywords and pages in a spiral rotation (406); randomly changing header of browser (408); and allowing data harvesting bot to access a targeted server (410).
5. The method (400) according to Claim 4, wherein randomly changing Internet Protocol address of data harvesting bot referral (402) further comprises steps of (500): accessing a plurality of harvest bot proxies with its own Internet Protocol address for each proxy (502); and randomly selecting any harvest bot proxy to access the targeted server (504).
6. The method (400) according to Claim 4, wherein delaying data harvesting bot at a random interval of time (404) further comprises steps of (600): reading cache expiration time setting (602); checking if data request to target server have been cached (604); if data request to target server have been cached, check difference of cached data time and compare difference of cached data time to cache expiration time setting to determine if cache data time is less then cache expiration time setting (606); if cached data time is less then cache expiration time setting, reusing cached data time and new data will not be requested from targeted server (610); else obtaining new data from targeted server and storing new data obtained as new cached data (608); else obtaining new data from targeted server and storing new data obtained as new cached data (608).
7. The method (400) according to Claim 6, wherein checking if data request to target server have been cached (604) further comprises delaying time of request of each new data randomly at a range of time predetermined by user.
8. The method (400) according to Claim 4, wherein changing keywords and pages in a spiral rotation (406) further comprises changing data harvesting bot to another targeted server with a different page within a predetermined time set by user.
9. The method (400) according to Claim 4, wherein randomly changing header of browser (408) further comprises steps of (700): randomizing rendering engine in Browser User Agent Randomizer (702); obtaining randomize version of browser from rendering engine (704); randomizing system and browser information (706); randomizing browser’s platform (708); randomizing platform details of browser’s platform (710); and randomizing extensions of browser to indicate specific enhancements that are available in browser (712).
PCT/MY2020/050090 2019-09-27 2020-09-22 A system and method to prevent bot detection WO2021060973A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
MYPI2019005731 2019-09-27
MYPI2019005731 2019-09-27

Publications (1)

Publication Number Publication Date
WO2021060973A1 true WO2021060973A1 (en) 2021-04-01

Family

ID=75165303

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/MY2020/050090 WO2021060973A1 (en) 2019-09-27 2020-09-22 A system and method to prevent bot detection

Country Status (1)

Country Link
WO (1) WO2021060973A1 (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110131652A1 (en) * 2009-05-29 2011-06-02 Autotrader.Com, Inc. Trained predictive services to interdict undesired website accesses
US20180276304A1 (en) * 2015-09-21 2018-09-27 Yissum Research Development Company Of The Hebrew University Of Jerusalem Ltd Advanced computer implementation for crawling and/or detecting related electronically catalogued data using improved metadata processing
US10182046B1 (en) * 2015-06-23 2019-01-15 Amazon Technologies, Inc. Detecting a network crawler

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110131652A1 (en) * 2009-05-29 2011-06-02 Autotrader.Com, Inc. Trained predictive services to interdict undesired website accesses
US10182046B1 (en) * 2015-06-23 2019-01-15 Amazon Technologies, Inc. Detecting a network crawler
US20180276304A1 (en) * 2015-09-21 2018-09-27 Yissum Research Development Company Of The Hebrew University Of Jerusalem Ltd Advanced computer implementation for crawling and/or detecting related electronically catalogued data using improved metadata processing

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
JIAHAO WU, ‘Anti-Scraping Techniques You May Encounter’, 09 August 2019, [retrieved on 2020.12.15]. Retrieved from <URL: https://www.octoparse.com/ blog/5-anti-scraping-techniques-you-may-encounter> pages 1-4 *
YANG Y.; VLAJIC N.; NGUYEN U. T.: "Web bots that mimic human browsing behavior on previously unvisited web-sites: Feasibility study and security implications", 2015 IEEE CONFERENCE ON COMMUNICATIONS AND NETWORK SECURITY (CNS), IEEE, 28 September 2015 (2015-09-28), pages 757 - 758, XP032825490, DOI: 10.1109/CNS.2015.7346921 *

Similar Documents

Publication Publication Date Title
US10567529B2 (en) Unified tracking data management
US8020206B2 (en) System and method of analyzing web content
US8359651B1 (en) Discovering malicious locations in a public computer network
US9031946B1 (en) Processor engine, integrated circuit and method therefor
US8997220B2 (en) Automatic detection of search results poisoning attacks
US20150199432A1 (en) Optimized web domains classification based on progressive crawling with clustering
US20180131708A1 (en) Identifying Fraudulent and Malicious Websites, Domain and Sub-domain Names
US20100306184A1 (en) Method and device for processing webpage data
CA2671183A1 (en) System and method of analyzing web addresses
RU2658878C1 (en) Method and server for web-resource classification
US7634458B2 (en) Protecting non-adult privacy in content page search
US20140259100A1 (en) Network security identification method, security detection server, and client and system therefor
US8713010B1 (en) Processor engine, integrated circuit and method therefor
Wambach et al. The evolution of third-party web tracking
Toffalini et al. Google dorks: Analysis, creation, and new defenses
CN106611029B (en) Method and device for improving search efficiency in website
JP6169497B2 (en) Connection destination information determination device, connection destination information determination method, and program
Takahashi et al. Tracing and analyzing web access paths based on {User-Side} data collection: How do users reach malicious {URLs}?
US20120167220A1 (en) Seed information collecting device and method for detecting malicious code landing/hopping/distribution sites
Agbefu et al. Domain information based blacklisting method for the detection of malicious webpages
WO2021060973A1 (en) A system and method to prevent bot detection
CN106611022B (en) Method and device for improving search efficiency in website
US9876819B2 (en) Method and system for restoring websites
CN110825976B (en) Website page detection method and device, electronic equipment and medium
CN114491229A (en) Identity tracing method, device, equipment, storage medium and program for attacker

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20866986

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20866986

Country of ref document: EP

Kind code of ref document: A1