WO2021060973A1

WO2021060973A1 - A system and method to prevent bot detection

Info

Publication number: WO2021060973A1
Application number: PCT/MY2020/050090
Authority: WO
Inventors: Amru Yusrin Bin AMRUDDIN; Rosnin Bin MUSTAFFA; Muhammad Awis Jamaluddin Bin JOHARI; Mohd Marzuq Ikram Bin MOHD HELMI
Original assignee: Mimos Berhad
Priority date: 2019-09-27
Filing date: 2020-09-22
Publication date: 2021-04-01

Abstract

The present invention provides a system (100) and a method (400) to prevent bot detection by replicating human like browsing behaviour. In particular, a Human Like Browsing Behaviour Identity Engine (102) is in communication with at least one server in a computing network of the system of the present invention. The Human Like Browsing Behaviour Identity Engine (102) comprises at least one Internet Protocol Number Randomizer (104) for randomly changing data harvesting bot referral Internet Protocol; at least one Time and Data Cache Randomizer (106) for delaying data harvesting bot at a random interval of time; at least one Spiral Keywords and Pages Rotation Module (108) for changing keywords and pages in spiral rotation; and at least one Browser User Agent Randomizer (110) for randomly changing browser header.

Description

A SYSTEM AND METHOD TO PREVENT BOT DETECTION

FIELD OF INVENTION

The present invention relates to a system and method to prevent bot detection by replicating human like browsing behaviour. In particular, the present invention provides a Human Like Browsing Behaviour Identity Engine which is in communication with at least one server in a computing network of the system to avoid server firewall or captcha to block or deny access to a third party.

BACKGROUND ART

Data harvesting, or web scraping, has always been a concern for website operators and data publishers. Data harvesting is a same process as data mining or data extraction. Data harvesting is a process of scraping data or information from websites by using an automated bot. In other words, data harvesting is the process where a small script, which is also known as a malicious bot, is used to extract data from websites in a large amount for use in other purposes. Data harvesting usually targets a database in particular. Data harvesting collects information through the database and then download each record in the database to be used for any particular purpose.

In the data harvesting, data is extracted from a targeted website. Such data can be any type of data that the harvester wants, such as a simple text found on the page or within the page’s code, directory information from a retail site, a series of image and video or it could be all of those items at once. Nevertheless, data harvesting bot is often been blocked by server firewall or by a captcha system to ensure the security of the data. The data harvesting bot is usually detected by the server because the identity used in the data harvesting bot is similar while accessing the target server.

United States Patent No. US 7139747 B1 (hereinafter referred to as US 747 B1 Patent) entitled “System and method for distributed web crawling” having a filing date of 3 November 2000 (Patentee: Hewlett-Packard Development Company, L.P., Houston, TX (US)) provides for efficient downloading of data set addresses from among a plurality of host computers, using a plurality of web crawlers. Each web crawler as disclosed in US 747 B1 Patent identifies URLs in data sets downloaded by web crawler, and identifies the host computer identifier within each of such URL. The host computer identifier for each URL is mapped to the web crawler identifier of one of the web crawlers. Each web crawler sends URLs to the other web crawlers for processing, and each web crawler receives URL's from the other web crawlers for processing. Further, each web crawler filters the URLs assigned to it by comparing them against a database of URLs already known by the web crawler and removing the already known URLs. If a URL is not already known to the web crawler, the data set corresponding to the URL is scheduled for downloading.

International Patent Application Publication No. WO 2015/021912 A1 (hereinafter referred to as “WO 912 A1 Publication) entitled “Incrementally retrieving data for objects to provide a desired level of detail” having filing date of 12 August 2014 (Patentee: International Business Machines Corporation” provides a computer-implemented method for retrieving data and metadata according to a data harvesting criteria into an indexed repository. The invention disclosed in the WO 912 A1 Publication also provides a user interface which allows a user to define data harvesting criteria for adjusting a subset of data and metadata on an indexed repository. Further, WO 912 A1 Publication utilizes user interface by defining the data harvesting criteria, the subset of data and metadata on the indexed repository is adjusted according to the data harvesting criteria.

United States Patent Application Publication No. US 2017/0098100 A1 (hereinafter referred to as the US 100 A1 Publication) entitled “Consent handling during data harvesting” having a filing date of 18 November 2015 (Patentee: SAP SE, Walldorf (DE)) describes that a method can include receiving social media data associated with a user identifier and a first country code. A stored consent configuration rule can specify whether to store the Social media data anonymously or non-anonymously. The consent configuration rule of US 100 A1 Publication can be associated with a second country code. It can further be determined whether the second country code associated with the consent configuration rule matches the first country code associated with the social media data.

With reference to the above-mentioned disclosures, there is indeed a need for a system and method that is able to prevent bot detection by replicating human like browsing behaviour. SUMMARY OF INVENTION

The present invention relates to a system and method to prevent bot detection by replicating human like browsing behaviour. In particular, a Human Like Browsing Behaviour Identity Engine (102) is in communication with at least one server in a computing network of the system of the present invention.

One aspect of the invention provides a system (100) to prevent bot detection by replicating human like browsing behaviour comprising a Human Like Browsing Behaviour Identity Engine (102) in communication with at least one server in a computing network. The Human Like Browsing Behaviour Identity Engine (102) comprises at least one Internet Protocol Number Randomizer (104) for randomly changing data harvesting bot referral Internet Protocol; at least one Time and Data Cache Randomizer (106) for delaying data harvesting bot at a random interval of time; at least one Spiral Keywords and Pages Rotation Module (108) for changing keywords and pages in spiral rotation; and at least one Browser User Agent Randomizer (110) for randomly changing browser header.

Another aspect of the invention provides that the at least one Internet Protocol Number Randomizer (104) further comprises a plurality of registered harvest bot proxies.

A further aspect of the invention provides that the plurality of registered harvest bot proxies further having its own Internet Protocol address for each proxy.

A further aspect of the invention provides a method (400) to prevent bot detection by replicating human like browsing behaviour. The method comprising steps of randomly changing Internet Protocol address of data harvesting bot referral (402); delaying data harvesting bot at a random interval of time (404); changing keywords and pages in a spiral rotation (406); randomly changing header of browser (408); and allowing data harvesting bot to access a targeted server (410).

Yet another aspect of the invention provides that the step for randomly changing Internet Protocol address of data harvesting bot referral (402) further comprises steps of (500) accessing a plurality of harvest bot proxies with its own Internet Protocol address for each proxy (502); and randomly selecting any harvest bot proxy to access a targeted server (504).

Still another aspect of the invention provides that the step for delaying data harvesting bot at a random interval of time (404) further comprises steps of (600) reading cache expiration time setting (602); checking if data request to target server have been cached (604); if data request to target server have been cached, check difference of cached data time and compare difference of cached data time to cache expiration time setting to determine if cache data time to cache expiration time setting is less then cache expiration time setting (606). If cached data time is less then cache expiration time setting, reusing cached data and new data will not be requested from targeted server (610) else obtaining new data from targeted server and storing new data obtained as new cached data (608). If data request to target server have not been cached, proceed to obtain new data from targeted server and store new data obtained as new cached data (608).

A further aspect of the invention provides that the step of checking if data request to target server have been cached (604) further comprises delaying time of request of each new data randomly at a range of time predetermined by user. Yet another aspect of the invention provides that the step of changing keywords and pages in a spiral rotation (406) further comprises changing data harvesting bot to another targeted server with a different page within a predetermined time set by user.

Still another aspect of the invention provides that the step of randomly changing header of browser (408) further comprises steps of (700) randomizing rendering engine in Browser User Agent Randomizer (702); obtaining randomize version of browser from rendering engine (704); randomizing system and browser information (706); randomizing browser’s platform (708); randomizing details of browser’s platform (710); and randomizing extensions of browser to indicate specific enhancements that are available in browser (712).

The present invention consists of features and a combination of parts hereinafter fully described and illustrated in the accompanying drawings, it being understood that various changes in the details may be made without departing from the scope of the invention or sacrificing any of the advantages of the present invention. BRIEF DESCRIPTION OF ACCOMPANYING DRAWINGS

To further clarify various aspects of some embodiments of the present invention, a more particular description of the invention will be rendered by references to specific embodiments thereof, which are illustrated in the appended drawings. It is appreciated that these drawings depict only typical embodiments of the invention and are therefore not to be considered limiting of its scope. The invention will be described and explained with additional specificity and detail through the accompanying drawings.

FIG. 1.0 illustrates the general architecture of the system of the present invention.

FIG. 2.0 illustrates the IP number randomizer module of the present invention. FIG. 3.0 illustrates a mechanism of spiral keywords and pages rotation.

FIG. 4.0 is a flowchart illustrating the methodology of the present invention.

FIG. 5.0 is a flowchart illustrating the steps of randomly changing Internet Protocol address of data harvesting bot referral.

FIG. 6.0 is a flowchart illustrating the steps of delaying data harvesting bot at a random interval of time.

FIG. 7.0 is a flowchart illustrating the steps of randomly changing header of browser.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

The present invention relates to a system and method to prevent bot detection by replicating human like browsing behaviour. In particular, a Human Like Browsing Behaviour Identity Engine (102) is in communication with at least one server in a computing network of the system of the present invention. Hereinafter, this specification will describe the present invention according to the preferred embodiments. It is to be understood that limiting the description to the preferred embodiments of the invention is merely to facilitate discussion of the present invention and it is envisioned without departing from the scope of the appended claims.

Reference is first made to FIG. 1.0 and FIG. 2.0 respectively. FIG. 1.0 illustrates the general system architecture of the present invention while FIG. 2.0 illustrates the IP number randomizer module of the present invention. As illustrated in FIG. 1.0, the system (100) to prevent bot detection by replicating human like browsing behaviour, comprising a Human Like Browsing Behaviour Identity Engine (102) in communication with at least one server (112) in a computing network. The Human Like Browsing Behaviour Identity Engine (102) comprises at least one Internet Protocol Number Randomizer (104) for randomly changing data harvesting bot referral Internet Protocol; at least one Time and Data Cache Randomizer (106) for delaying data harvesting bot at a random interval of time; at least one Spiral Keywords and Pages Rotation Module (108) for changing keywords and pages in spiral rotation; and at least one Browser User Agent Randomizer (110) for randomly changing browser header. As illustrated in FIG. 2.0, the at least one Internet Protocol Number Randomizer (104) further comprises a plurality of registered harvest bot proxies whereby the plurality of registered harvest bot proxies further having its own Internet Protocol address for each proxy. The harvest proxy bot is chosen randomly by the data harvesting bot to access the targeted server (112) to generate different IP address for each of the data harvest. The IP address for each of the data harvest is automatically generated based on the number of proxy server whereby the IP address is randomized between IP address of proxy server.

Reference is now made to FIG. 3.0 which illustrates a mechanism of spiral keywords and pages rotation. As illustrated in FIG. 3.0, spiral keywords and page rotation module is performed to avoid data harvesting bot crawling for the searched keywords in the same targeted server for a long time by changing it from one targeted server to another targeted server in a time frame. For example, a web bot is assigned to grab all data based on a keyword “Malaysia” in google search, yahoo search, bing search, facebook search etc... The page is considered as pagination on a search result in a search engine. If a user were to search on google page the pagination is at the bottom of the search results.

Table 1.0

Table 2.0

Table 1.0 illustrates the display of the results of the currently available method to grab data on the targeted server as Table 1 .0 is a normal web bot method where web bot will grab all data in the same category or source. As illustrated in Table 1.0, in 2 seconds Data Harvesting Bot stays in Web Google server to obtain data containing “Malaysia” in page 1 and 2 before changing to another targeted server such as Blogspot and Facebook. For example, if there are a total of 100 pages, data harvesting bot will stay in the same target server for 100 seconds in which the firewall and captcha system will block the action.

Table 2.0 illustrates the display of the results of the method of the present invention to grab data on targeted server. , As illustrated in Table 2.0, data harvesting bot will not stay in Web Google server in 4 seconds to obtain data containing “Malaysia” in page 1 and 2. The method of the present invention provides for data harvesting bot to jump to other targeted server such as Google, Blogspot, Facebook and Bing with each second at different targeted server with different pages. There is a delay of about 3 seconds for data harvesting bot come to return to the same targeted server. The delay will not cause data harvesting bot to be blocked by firewall or captcha system.

Reference is now made to FIG. 4.0 which is a flowchart illustrating the methodology of the present invention. As illustrated in FIG. 4.0, the method (400) to prevent bot detection by replicating human like browsing behaviour comprising steps of first randomly changing Internet Protocol address of data harvesting bot referral (402). Thereafter, data harvesting bot is delayed at a random interval of time (404) and keywords and pages are changed in a spiral rotation (406) while header of browser is changed randomly (408). The delay of data harvesting bot is set on bot setting on the expiry time of the cached file. The step of changing keywords and pages in a spiral rotation (406) further comprises changing data harvesting bot to another targeted server with a different page within a predetermined time set by user. Subsequently, data harvesting bot is allowed to access a targeted server (410) This is such as all public server such as Google, Yahoo and Facebook allows access for public to use the search engine with a condition to access the server. The condition provided by public server is that bot is not allowed. Bot is usually detected if the behavior of the requester is from the same IP address which denotes the same server, same browser header, same keywords or page request

Reference is now made to FIG. 5.0 which is a flowchart illustrating the steps of randomly changing Internet Protocol address of data harvesting bot referral. Randomly changing Internet Protocol address of data harvesting bot referral (402) further comprises steps of (500) accessing a plurality of harvest bot proxies with its own Internet Protocol address for each proxy (502) and randomly selecting any harvest bot proxy to access a targeted server (504).

Reference is now made to FIG. 6.0 which is a flowchart illustrating the steps of delaying data harvesting bot at a random interval of time. As illustrated in FIG. 6.0, delaying data harvesting bot at a random interval of time (404) further comprises steps of (600) first reading cache expiration setting (602). Thereafter, it is checked if data request to target server have been cached (604). If data request to target server have been cached (606), check difference of cached data time whereby cached data time is current system time up to cached data time and compare difference of cached data time to cache expiration time setting to determine if cache data time is less then cache expiration time setting (606). If cache data time is less then cache expiration time setting, cache data is reused and new data will not be requested from targeted server else if cache data time is more then cache expiration time setting, new data will be obtain from targeted server and new data obtained will be stored as new cache data (608). If data request to target server has not been cached (604), new data will be obtain from targeted server and new data obtained will be stored as new cached data (608). The step in checking if data request to target server have been cached (604) further comprises delaying time of request of each new data randomly at a range of time predetermined by user.

Reference is now made to FIG. 7.0 which is a flowchart illustrating the steps of randomly changing header of browser. As illustrated in FIG. 7.0, the step of randomly changing header of browser (408) further comprises steps of (700) randomizing rendering engine in Browser User Agent Randomizer (702) whereby the rendering engine refers to Firefox, Chrome, Safari, Opera, or any other browser. Bot can be randomly assigned to grab data through available browser installed in Windows. Thereafter,, randomize version of browser is obtained from rendering engine (704). Subsequently, system and browser information is randomized (706); browser’s platform is randomized (708); details of browser’s platform are randomized (710); and extensions of browser are randomized to indicate specific enhancements that are available in browser (712). The current browser user agent identity is divided into six segments which are rendering engine segment, version segment, system and browser information segment, platform segment, platform details segment and extensions segment. During this step, browser user agent randomizer module (110) will generate randomize string for each segment of user agent identity.

The examples of browser user agent randomizer (110) output example from same data harvesting bot is as shown below:

[Rendering Engine]/[version]([system and browser information])[platform]([platform details]) [extensions]

Mozilla/5.0 (iPad; U; CPU OS 3_2_ 1 like Mac OS X; en-us)AppleWebKit/531.21.10 (KHTML, like Gecko) Mobile/7B405

Mosaic/1.0 (compatible; MSIE 9.0; Windows NT 6.0; SV1;.NET CLR 1.0.3705;.NET CLR 1.1.4322) WebKit/100 (Gecko) XBOX

Mozilla/5.0 (iPad; U; CPU OS 3_2_1 like Mac OS X; en-us)AppleWebKit/531.21.10 (KHTML, like Gecko) Mobile/7B405

Mosaic/1.0 (compatible; MSIE 9.0; Windows NT 6.0; SV1;.NET CLR 1.0.3705;.NET CLR 1.1.4322) WebKit/100 (Gecko) XBOX The present invention provides a Human Like Browsing Behaviour Identity Engine having an IP number randomizer module, a time and data cache randomizer module, a spiral keywords and page rotation module and a browser user agent randomizer module in the system. The Human Like Browsing Behaviour Identity Engine mimics human like behaviour in web browsing by randomizing IP address by accessing targeted server through IP number randomizer; performing cache content and time delay on accessing targeted server through time and data cache randomizer; rotating pages providing spiral keywords for targeted server through spiral keywords and pages rotation; and generating randomize browser version header through browser header randomizer (140).

Unless the context requires otherwise or specifically stated to the contrary, integers, steps or elements of the invention recited herein as singular integers, steps or elements clearly encompass both singular and plural forms of the recited integers, steps or elements. Throughout this specification, unless the context requires otherwise, the word “comprise”, or variations such as “comprises” or “comprising”, will be understood to imply the inclusion of a stated step or element or integer or group of steps or elements or integers, but not the exclusion of any other step or element or integer or group of steps, elements or integers. Thus, in the context of this specification, the term “comprising” is used in an inclusive sense and thus should be understood as meaning “including principally, but not necessarily solely”.

Claims

1. A system (100) to prevent bot detection by replicating human like browsing behaviour, comprising: at least one server in a computing network (112); and a Human Like Browsing Behaviour Identity Engine (102) in communication with the at least one server in the computing network, characterized in that, the Human Like Browsing Behaviour Identity Engine

(102) comprises: at least one Internet Protocol Number Randomizer (104) for randomly changing data harvesting bot referral Internet Protocol; at least one Time and Data Cache Randomizer (106) for delaying data harvesting bot at a random interval of time; at least one Spiral Keywords and Pages Rotation Module (108) for changing keywords and pages in spiral rotation; and at least one Browser User Agent Randomizer (110) for randomly changing browser header.

2. The system (100) according to Claim 1 , wherein the at least one Internet Protocol Number Randomizer (104) further comprises a plurality of registered harvest bot proxies.

3. The system (100) according to Claim 2, wherein the plurality of registered harvest bot proxies further having its own Internet Protocol address for each proxy.

4. A method (400) to prevent bot detection by replicating human like browsing behaviour, the method is characterized by the steps of: randomly changing Internet Protocol address of data harvesting bot referral (402); delaying data harvesting bot at a random interval of time (404); changing keywords and pages in a spiral rotation (406); randomly changing header of browser (408); and allowing data harvesting bot to access a targeted server (410).

5. The method (400) according to Claim 4, wherein randomly changing Internet Protocol address of data harvesting bot referral (402) further comprises steps of (500): accessing a plurality of harvest bot proxies with its own Internet Protocol address for each proxy (502); and randomly selecting any harvest bot proxy to access the targeted server (504).

6. The method (400) according to Claim 4, wherein delaying data harvesting bot at a random interval of time (404) further comprises steps of (600): reading cache expiration time setting (602); checking if data request to target server have been cached (604); if data request to target server have been cached, check difference of cached data time and compare difference of cached data time to cache expiration time setting to determine if cache data time is less then cache expiration time setting (606); if cached data time is less then cache expiration time setting, reusing cached data time and new data will not be requested from targeted server (610); else obtaining new data from targeted server and storing new data obtained as new cached data (608); else obtaining new data from targeted server and storing new data obtained as new cached data (608).

7. The method (400) according to Claim 6, wherein checking if data request to target server have been cached (604) further comprises delaying time of request of each new data randomly at a range of time predetermined by user.

8. The method (400) according to Claim 4, wherein changing keywords and pages in a spiral rotation (406) further comprises changing data harvesting bot to another targeted server with a different page within a predetermined time set by user.

9. The method (400) according to Claim 4, wherein randomly changing header of browser (408) further comprises steps of (700): randomizing rendering engine in Browser User Agent Randomizer (702); obtaining randomize version of browser from rendering engine (704); randomizing system and browser information (706); randomizing browser’s platform (708); randomizing platform details of browser’s platform (710); and randomizing extensions of browser to indicate specific enhancements that are available in browser (712).