WO2023025552A1 - Adaptive data collection optimization - Google Patents
Adaptive data collection optimization Download PDFInfo
- Publication number
- WO2023025552A1 WO2023025552A1 PCT/EP2022/071835 EP2022071835W WO2023025552A1 WO 2023025552 A1 WO2023025552 A1 WO 2023025552A1 EP 2022071835 W EP2022071835 W EP 2022071835W WO 2023025552 A1 WO2023025552 A1 WO 2023025552A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- scraping
- parameters
- scraping request
- extractor
- request parameters
- Prior art date
Links
- 238000013480 data collection Methods 0.000 title abstract description 126
- 238000005457 optimization Methods 0.000 title description 3
- 230000003044 adaptive effect Effects 0.000 title description 2
- 238000000034 method Methods 0.000 claims abstract description 115
- 238000010801 machine learning Methods 0.000 claims abstract description 52
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 47
- 238000007790 scraping Methods 0.000 claims description 237
- 230000004044 response Effects 0.000 claims description 106
- 230000008569 process Effects 0.000 claims description 35
- 230000009471 action Effects 0.000 claims description 9
- 238000012546 transfer Methods 0.000 claims description 6
- 230000000977 initiatory effect Effects 0.000 claims description 5
- 238000010200 validation analysis Methods 0.000 claims description 4
- 238000010586 diagram Methods 0.000 description 18
- 238000000605 extraction Methods 0.000 description 12
- 230000006399 behavior Effects 0.000 description 9
- 238000004891 communication Methods 0.000 description 8
- 238000005516 engineering process Methods 0.000 description 7
- 230000015654 memory Effects 0.000 description 7
- 238000012545 processing Methods 0.000 description 7
- 230000008901 benefit Effects 0.000 description 6
- 238000003860 storage Methods 0.000 description 5
- 230000000694 effects Effects 0.000 description 4
- 239000008186 active pharmaceutical agent Substances 0.000 description 3
- 238000013473 artificial intelligence Methods 0.000 description 3
- 230000007246 mechanism Effects 0.000 description 3
- 230000006855 networking Effects 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 230000008520 organization Effects 0.000 description 3
- 230000002787 reinforcement Effects 0.000 description 3
- 238000001514 detection method Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 230000018109 developmental process Effects 0.000 description 2
- 238000003306 harvesting Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000007619 statistical method Methods 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 241000239290 Araneae Species 0.000 description 1
- 238000001545 Page's trend test Methods 0.000 description 1
- 230000009118 appropriate response Effects 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000000903 blocking effect Effects 0.000 description 1
- 238000004883 computer application Methods 0.000 description 1
- 238000004590 computer program Methods 0.000 description 1
- 238000007596 consolidation process Methods 0.000 description 1
- 235000014510 cooky Nutrition 0.000 description 1
- 238000013075 data extraction Methods 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 238000003066 decision tree Methods 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000012417 linear regression Methods 0.000 description 1
- 238000007477 logistic regression Methods 0.000 description 1
- 238000012423 maintenance Methods 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 238000003909 pattern recognition Methods 0.000 description 1
- 238000011084 recovery Methods 0.000 description 1
- 238000009877 rendering Methods 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 230000001360 synchronised effect Effects 0.000 description 1
- 238000012549 training Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/906—Clustering; Classification
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/50—Network services
- H04L67/56—Provisioning of proxy services
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/01—Dynamic search techniques; Heuristics; Dynamic trees; Branch-and-bound
Definitions
- the disclosure belongs to the area of web scraping and data collection technologies. Methods and systems detailed herein aim to optimize web scraping processes, wherein, the optimization is achieved through employing machine learning algorithms.
- Web scraping also known as screen scraping, data mining, web harvesting
- Web scraping in its most general sense is the automated gathering of data from the internet. More technically, it is the practice of gathering data from the internet through any means other than a human using a web browser or a program interacting with an application programming interface (API). Web scraping is usually accomplished by executing a program that queries a web server and requests data automatically, then parses the data to extract the requested information.
- API application programming interface
- Web scrapers are programs written for web scraping that can have a significant advantage over other means of accessing information, like web browsers. The latter are designed to present the information in a readable way for humans, whereas web scrapers are excellent at collecting and processing large amounts of data quickly. Rather than opening one page at a time through a monitor (as web browsers do), web scrapers are able to collect, process, aggregate and present large databases consisting of thousands or even millions of pages at once.
- the website allows another automated way to transfer its structured data from one program to another via an API.
- a program will make a request to an API via Hypertext Transfer Protocol (HTTP) for some type of data, and the API will return this data from the website in a structured form. It serves as a medium to transfer the data.
- HTTP Hypertext Transfer Protocol
- APIs is not considered web scraping since the API is offered by the website (or a third party) and it removes the need for web scrapers.
- API can transfer well-formatted data from one program to another and the process of using it is easier than building a web scraper to get the same data.
- APIs are not always available for the needed data.
- APIs often use volume and rate restrictions and limit the types and the format of the data. Thus, a user would use web scraping for the data for which an API does not exist or which is restricted in any way by the API.
- web scraping consists of the following steps: retrieving Hypertext Markup Language (HTML) data from a website; parsing the data for target information/data; saving target information/data; repeating the process if needed on another page.
- HTML Hypertext Markup Language
- a program that is designed to do all of these steps is called a web scraper.
- a related program - a web crawler (also known as a web spider) - is a program or an automated script which performs the first task, i.e. it navigates the web in an automated manner to retrieve raw HTML data of the accessed web sites (this process is also known as indexing).
- Scraping activity may be performed/executed by multiple types of scraping applications, generally categorized as follows:
- Browser an application executed within a computing device, usually in the context of an end-user session, with the functionality sufficient to accept the user’s request, pass it to the target Web server, process the response from the Web server and present the result to the user.
- Browser is considered a user-side scripting enabled tool e.g. capable of executing and interpreting JavaScript code.
- Headless browser - a web browser without a graphical user interface (GUI).
- GUI graphical user interface
- Headless browsers provide automated control of a web page in an environment similar to popular web browsers but are executed via a command-line interface or using network communication. They are particularly useful for testing web pages as they are able to render and understand HTML the same way a browser would, including styling elements such as page layout, color, font selection and execution of JavaScript and AJAX which are usually not available when using other testing methods.
- Two major use cases can be identified: a) scripted web page tests - with the purpose of identifying bugs, whereas a close resemblance to a user activity is necessary, b) web scraping - where resemblance to a user activity is mandatory to avoid blocking, i.e. the request should possess all the attributes of an organic web browsing request.
- Headless browser is considered a user-side scripting enabled tool e.g. capable of executing and interpreting JavaScript code.
- Command line tools GUI-less applications that allow to generate and submit a Web request through a command line terminal e.g. CURL. Some tools in this category may have a GUI wrapped on top, but the graphical elements would not cover displaying the result of the HTTP request. Command line tools are limited in their functionality in that they are not capable of executing and interpreting JavaScript code.
- Programming language library a collection of implementations of behavior, written in terms of a language, that has a well-defined interface by which the behavior is invoked. For instance, when particular HTTP methods are to be invoked for executing scraping requests the scraping application can use a library containing said methods to make system calls instead of implementing those system calls over and over again within the program code.
- the behavior is provided for reuse by multiple independent programs, where the program invokes the library-provided behavior via a mechanism of the language. Therefore, the value of a library lies in the reuse of the behavior.
- a program invokes a library it gains the behavior implemented inside that library without having to implement that behavior itself. Libraries encourage the sharing of code in a modular fashion, and ease the distribution of the code.
- Programming language libraries are limited in their functionality in that they are not capable of executing and interpreting JavaScript code, unless there is another tool capable of user-side scripting, for which the library is a wrapper.
- the scraping application types listed above vary in the technical capabilities they possess, often due to the very purpose the application has been developed for. While sending the initial request to the target Web server all of the listed types of scraping applications pass the parameters mandatory for submitting and processing a web request, e.g. HTTP parameters - headers, cookies, declare the version of HTTP protocol they support and intend to communicate in, with Transmission Control Protocol (TCP) parameters disclosed while initiating the TCP session underlying the HTTP request (e.g. TCP Windows size and others).
- TCP Transmission Control Protocol
- browsers and headless browsers can process the JavaScript files obtained within the web server’s response e.g. submit configuration settings through JavaScript when requested, while command line utilities are incapable of doing that.
- browsers and headless browsers can process the JavaScript files obtained within the web server’s response e.g. submit configuration settings through JavaScript when requested, while command line utilities and programming language libraries are incapable of doing that.
- Machine learning is a branch of artificial intelligence (Al) and computer science which focuses on the use of data and algorithms to imitate the way that humans learn, gradually improving its accuracy.
- the goal of Machine Learning technology is to optimize the performance of a system when handling new instances of data through user defined logic for a given environment. To achieve this goal effectively, machine learning depends enormous upon statistical and computer sciences.
- Statistical methods provide machine learning algorithms ways to infer conclusions from data.
- Computer science methods give machine learning algorithms the computing power to solve problems, including useful large-scale computational architectures and algorithms for capturing, manipulating, indexing, combining and performing the predictions with data.
- machine learning technologies are mainly applied to analysis, prediction, permission control, and personalization.
- machine learning technologies are used to predict privacy preferences of mobile users when using smart applications.
- Machine learning has become an important component of the growing field of computer science. Through the use of statistical methods, machine learning algorithms are trained to make classifications or predictions, finding key insights within data sets. These insights subsequently drive decision making within applications and businesses, ideally improving the development metrics.
- An ‘algorithm’ in machine learning is a procedure that is run on data to create a machine learning ‘model’.
- Machine learning algorithms can learn and perform pattern recognition present within data sets. For example, Linear Regression, Logistic Regression, Decision Tree and Artificial Neural Network are some of the few examples of machine learning algorithms. Few exemplary features of machine learning algorithms are: a)Machine learning algorithms can be described using math and pseudocode, b) The efficiency of machine learning algorithms can be analyzed and described, c) Machine learning algorithms can be implemented with any one of a range of modern programming languages.
- a ‘model’ in machine learning is the output of a machine learning algorithm.
- a model represents what was learnt by a machine learning algorithm.
- Model is the result constructed after running a machine learning algorithm on training data sets and represents rules, numbers, and any other algorithm-specific data structures required to make predictions.
- Supervised learning is the type of machine learning where a problem is defined and the system is provided with multiple examples of how the problems may be solved through curated and validated examples.
- unsupervised learning does not work on improving itself based on “experience” to solve clearly-defined problems. Instead, this form of machine learning is actually designed to seek out and identify patterns from within large sets of incongruous data.
- Unsupervised data attempts to group (cluster) the data based on various attributes that are recognized from processing. This, in turn, sets the stage for humans to analyze the processed data, recognize non-obvious correlations between elements, and establish relationships between vast amounts of data (wherever applicable).
- the third type of machine learning which is reinforcement learning, is about allowing computer systems to experiment with all possible means and methods for executing a task, scoring all those different iterations based on clearly-defined performance criteria and then choosing the method with the best score for deployment.
- the computer system will be rewarded with points for meeting success criteria and penalized for failing some or all of them in each reinforcement iteration.
- a website can recognize the bot-like behaviour of web scrapers.
- One such way is to monitor the number and durations of requests, i.e., the rate of action (or actions over time). This is because humans typically perform fewer actions/requests than a bot or a computer application. Therefore, by monitoring the rate of actions, websites can detect and block any bot-like behaviour originating from an IP address.
- Web scrapers often face financial losses when several scraping attempts fail or are blocked. Therefore, to circumvent such instances, web scrapers need to intelligently choose multiple parameters or strategies to execute each scraping request successfully.
- scraping parameters or strategies are crucial for the successful execution of scraping requests. Choosing scraping parameters or strategies at random can never ensure the success of scraping requests at every instance. Furthermore, in order to successfully execute the scraping requests, web scrapers must try to adapt and identify parameters or strategies depending on the nature of the requests, targets, proxies etc. For instance, a particular combination of scraping parameters or strategies may not always be successful on a particular target website.
- web scrapers are in need of methods and systems to intelligently identify and select the most effective parameters or strategies to execute their scraping requests individually. Additionally, web scrapers need the capabilities to analyze the clients’ requests to decide upon the optimal combination of parameters or strategies to execute the individual scraping requests. Nevertheless, implementing such methods can be resourceintensive and time consuming for web scraping service providers.
- the present embodiments disclosed herein provide at least the following solutions: a) to intelligently choose and adapt the right combination of scraping parameters or strategies according to the nature of the individual scraping requests and their respective targets; b)to implement several machine learning algorithms to aid the process of choosing the right combination of scraping parameters; c)to evaluate and score the parameters based on their effectiveness in executing the scraping requests, d) to choose the right combination of parameters based on their cost-effectiveness.
- the embodiments presented herewith provides a system and method for choosing the right combination of data collection parameters for each data collection request originating from a user.
- the right combination of data collection parameters is achieved by implementing a machine learning algorithm.
- the chosen data collection parameters are cost-effective in executing the data collection requests.
- the present embodiments provide a system and method to generate feedback data according to the effectiveness of the data collection parameters. Additionally, the present embodiments score the data collection parameters according to the feedback data and the overall cost, which are then stored in an internal database.
- Figure 1 shows a block diagram of an exemplary architectural depiction of elements.
- Figure 2A is an exemplary sequence diagram showing the effective execution of a data collection (scraping) request.
- Figure 2B is the continuation of an exemplary sequence diagram showing the effective execution of a data collection (scraping) request.
- FIG. 2C is the continuation of an exemplary sequence diagram showing the effective execution of a data collection (scraping) request.
- Figure 2D is the continuation of an exemplary sequence diagram showing the effective execution of a data collection (scraping) request.
- Figure 3 A is an exemplary sequence diagram showing the effective execution of data collection requests by accepting one or more data collection (scraping) parameters from user device 102.
- Figure 3B is the continuation of an exemplary sequence diagram showing the effective execution of a data collection (scraping) request by accepting one or more data collection parameters from user device 102.
- Figure 3C is the continuation of an exemplary sequence diagram showing the effective execution of a data collection (scraping) request by accepting one or more data collection parameters from user device 102.
- FIG. 3D is the continuation of an exemplary sequence diagram showing the effective execution of a data collection (scraping) request by accepting one or more data collection (scraping) parameters from user device 102.
- Figure 4 is an exemplary sequence diagram showing the flow of feedback data.
- Figure 5 shows a block diagram of an exemplary computing system.
- User device 102 can be any suitable user computing device including, but not limited to, a smartphone, a tablet computing device, a personal computing device, a laptop computing device, a gaming device, a vehicle infotainment device, an intelligent appliance (e.g., smart refrigerator or smart television), a cloud server, a mainframe, a notebook, a desktop, a workstation, a mobile device, or any other electronic device used for making a scraping request.
- an intelligent appliance e.g., smart refrigerator or smart television
- a cloud server e.g., a mainframe, a notebook, a desktop, a workstation, a mobile device, or any other electronic device used for making a scraping request.
- the term “user” is being used in the interest of brevity and may refer to any of a variety of entities that may be associated with a subscriber account such as, for example, a person, an organization, an organizational role within an organization and/or a group within an organization.
- user device 102 can send requests to collect data from target website(s) (represented by target 124).
- user device 102 sends the data collection requests to extractor 106, present in the service provider infrastructure 104.
- Data collection requests sent by user device 102 can be synchronous or asynchronous and may be sent in different formats.
- Service provider infrastructure 104 is the combination of the elements comprising the platform that provides the service of collecting data from target website(s) efficiently by executing the data collection requests sent by user device 102.
- SPI 104 comprises extractor 106, extraction optimizer 110, block detection unit 120 and proxy rotator 108.
- Extractor 106 is an element of the service provider infrastructure 104 that, among other things, is responsible for receiving and executing the data collection requests sent by user device 102.
- One role of extractor 106 is to request a set of suitable parameters from extraction optimizer 110 for executing the data collection requests. Extractor 106 executes the data collection requests by adhering to a set of suitable parameters through an appropriate proxy server (represented by proxy 122). Upon receiving the response data from the target website(s), extractor 106 returns the data to the user device 102 or executes additional data collection activities upon identifying a discrepancy in the response data.
- Another important role of the extractor 106 is to send feedback data to extraction optimizer 110 after completing the execution of each data collection request. The feedback data contains information regarding the effectiveness of a set of suitable parameters received from extraction optimizer 110.
- extractor 106 may be a third party element not present within the service provider infrastructure 104 but communi cably connected to extractor optimizer 202, proxy rotator 108 and block detector 120. However, such an arrangement will not alter the functionality of extractor 106 in any way.
- Extraction optimizer 110 is an element of service provider infrastructure 104.
- extractor optimizer 110 comprises elements that, among other things, are responsible for identifying and selecting the suitable set of parameters for each data collection request executed by extractor 106.
- extractor optimizer 110 comprises gateway 112, optimizer 114, database 116, and valuation unit 118.
- Block detector 120 is an element of service provider infrastructure 104 and is responsible for evaluating and classifying the response data as either ‘block’ or a ‘nonblock’.
- Block detector 120 receives the response data from extractor 106 to evaluate and classify the response data, and after the classification process, block detector 120 returns the classification results to extractor 106.
- a ‘non-block' response data contains the actual content requested by the user device 102.
- Block detector 120 may comprise multiple components (not shown) that provide the functionalities described above.
- Proxy rotator 108 is an element of service provider infrastructure 104 and is responsible for proxy control, rotation, maintenance, and collecting statistical data. Proxy rotator 108 receives requests from extractor 106 for information regarding specific proxy servers. In response to such requests, proxy rotator 108 provides appropriate proxy server information such as, for example, IP addresses to extractor 106.
- Proxy 122 represents an exemplary multitude of proxy servers (computer systems or applications), that acts as an intermediary element between extractor 106 and target 124 in executing the data collection requests. Proxy 122 receives the data collection requests from extractor 106 and forwards data collection requests to target websites(s) (represented by target 124). Further, proxy 122 receives the response data sent by target 124 and forwards the response data to extractor 106.
- proxy servers computer systems or applications
- Target 124 represents an exemplary multitude of web servers serving content accessible through several internet protocols, target 124 is presented here as an exemplary representation that there can be more than one target, but it should not be understood in any way as limiting the scope of the disclosure.
- Network 126 is a digital telecommunication network that allows several elements of the current embodiments to share and access resources. Examples of a network: localarea networks (LANs), wide-area networks (WANs), campus-area networks (CANs), metropolitan-area networks (MANs), home-area networks (HANs), Intranet, Extranet, Internetwork, Internet.
- LANs localarea networks
- WANs wide-area networks
- CANs campus-area networks
- MANs metropolitan-area networks
- HANs home-area networks
- Intranet Extranet
- Internetwork Internet.
- Gateway 112 is an element of the extraction optimizer 110 and is responsible for providing interoperability between the elements of extraction optimizer 110 and the elements of SPI 104.
- interoperability denotes the continuous ability to send and receive data among the interconnected elements in a system.
- Gateway 112 receives and forwards requests for the suitable set of parameters from extractor 106 to optimizer 206, respectively. Further, gateway 112 receives and forwards the suitable set of parameters from optimizer 206 to extractor 106, respectively.
- Optimizer 204 is an element of extraction optimizer 110 that, among other things, is responsible for identifying and selecting the suitable parameters for executing each data collection request. Optimizer 114 obtains the necessary information to identify and select a suitable set of parameters for executing a data collection request from database 116. Moreover, the identification and selection of suitable sets of parameters are carried out by optimizer 114 by employing machine learning algorithms. Additionally, optimizer 204 receives the feedback data sent by extractor 106 via gateway 112 and forwards the feedback data to valuation unit 118.
- Database 116 is an element of extraction optimizer 110 and is a storage unit that stores multiple sets of parameters coupled with their respective scores received from optimizer 114.
- Valuation unit 118 is an element of extraction optimizer 110 and is responsible for scoring each set of suitable parameters sent by optimizer 114.
- Valuation unit 118 may comprise configuration files. These configuration files may contain cost values for each data collection parameter.
- valuation unit 118 may comprise computing elements capable of calculating the costs for a given set of data collection parameters.
- valuation unit 118 is responsible for scoring the set of suitable parameters based on the feedback data and the overall cost to implement the set of suitable parameters while executing the particular data collection request.
- valuation unit 118 may assign the scores based on a specific machine learning algorithm.
- configuration files can be stored externally to valuation unit 118.
- the embodiments disclosed herein provide a plurality of systems and methods to optimize data collection requests by intelligently identifying suitable parameters and executing such requests efficiently. Further, the embodiments disclosed herein utilize machine learning algorithms to intelligently identify specific, cost-effective and suitable parameters to execute each data collection request originating from user device 102. These suitable parameters allow selection of a scraping strategy, which can be selected, for example, to save financial costs and, in at least one instance, to allow a strategy that is cheaper to be selected and implemented in lieu of another strategy implemented at a higher cost. In at least this example, the cheaper strategy provides the same or similar efficiency to the more expensive strategy.
- the cheaper strategy can include using cheaper exit nodes (e.g., data center exit nodes) to implement a user request.
- the term ‘parameter’, as described herein, may refer to a wide range of specifications that are necessary to execute data collection requests successfully and efficiently. At times, the term ‘parameter(s)’ or ‘data collection parameter(s)’ may be used interchangeably with each other.
- parameters may include specifications about the required type of proxy, the required location of the proxy server, and the required type of operating system. To expand further, a typical list of parameters that are essential for successful data collection requests are, but are not limited to : proxy type; proxy location; proxy ASN (Autonomous System Number);
- the embodiments disclosed herein provide a solution to identify the right combination/set of suitable and cost-effective parameters for executing each data collection request originating from user device 102.
- the embodiments demonstrated in Figure 1 show user device 102 communicating with service provider infrastructure 104 via network 126 to acquire data from target 124.
- the service provider infrastructure 104 comprises extractor 106, extraction optimizer 110, block detection unit 120 and proxy rotator 108.
- the extraction optimizer 110 further comprises gateway 112, optimizer 114, database 116, and valuation unit 118.
- network 126 can be local-area networks (LANs), wide-area networks (WANs), campus-area networks (CANs), metropolitan-area networks (MANs), home-area networks (HANs), Intranet, Extranet, Internetwork, Internet.
- LANs local-area networks
- WANs wide-area networks
- CANs campus-area networks
- MANs metropolitan-area networks
- HANs home-area networks
- Intranet Extranet
- Internetwork Internet
- connection to network 126 may require that the user device 102, service provider infrastructure 104, proxy 122, and target 124 execute software routines that enable, for example, the seven layers of the OSI model of the telecommunication network or an equivalent in a wireless telecommunication network.
- the elements shown in Figure 1 can have alternative names or titles. Moreover, the elements shown in Figure 1 can be combined into a single element instead of two discrete elements (for example, gateway 112 and proxy optimizer 114 can be co-located as a single element.) However, the functionality and the communication flow between elements are not altered by such consolidations. Therefore, Figure 1, as shown, should be interpreted as exemplary only and not restrictive or exclusionary of other features, including features discussed in other areas of this disclosure.
- extractor 106 can communicate with outside elements such as user device 102 and proxy 122 via network 126.
- outside elements such as user device 102 and proxy 122 via network 126.
- all communication occurrences between the elements occur through standard network communication protocols such as, but not limited to, TCP/IP, UDP.
- user device 102 initially establishes a network communication channel with extractor 104 via network 126 as per standard network communication protocols, e.g., HTTP(S).
- a network communication protocol provides a system of rules that enables two or more entities in a network to exchange information.
- the protocols define rules, syntaxes, semantics and possible error recovery methods.
- user device 102 Upon establishing the network communication channel with extractor 106, user device 102 sends a data collection request to collect and/or gather data from target 124.
- the data collection request is sent to extractor 106 by user device 102 via network 126.
- the data collection request may comprise an URL of the target (in this case, the URL of target 124) coupled with one or more parameters for executing the particular data collection request.
- User device 102 may include one or more parameters such as, for example, proxy location and proxy type depending upon the resources available to user device 102 and on the configuration of the extractor 106.
- Extractor 106 receives the data collection request and, in turn, requests optimizer 114 via gateway 112 for a set of suitable parameters to efficiently execute the data collection request.
- the service provider infrastructure 104 may configure extractor 106 to disregard every parameter sent by user device 102.
- extractor 106 requests optimizer 114 via gateway 112 for a complete set of suitable parameters to execute the data collection request on target 124.
- a ‘set of suitable parameters’ or a ‘complete set of suitable parameters’ or ‘complete set’ may refer to a list of specific parameters suitable to effectively execute the particular data collection request.
- the number of parameters present in a ‘complete set of suitable parameters’ depends on the policy configuration of extractor 106 and service provider infrastructure 104.
- the service provider infrastructure 104 may configure extractor 106 to accept one or more parameters sent by user device 102 and disregard rest of the parameters.
- extractor 106 communicates the accepted parameter(s) to optimizer 114 and requests for a partial set of suitable parameters.
- the partial set of suitable parameters will comprise the accepted parameter(s) coupled with several other suitable parameters necessary to effectively execute the particular data collection request.
- Optimizer 114 receives the request for a set of parameters (either for a complete set or a partial set) from extractor 106 via gateway 112.
- Optimizer 112 responds to the request received from extractor 106 by initiating the process of identifying and selecting a set of suitable parameters.
- optimizer 114 identifies and selects a set of suitable parameters by accessing database 116 and by implementing a machine learning algorithm.
- the type of machine learning algorithm used by optimizer 114 is determined by service provider infrastructure 104.
- Optimizer 114 identifies, selects and sends a set of suitable parameters to extractor 106 via gateway 112. Upon receiving a set of suitable parameters, extractor 106 requests proxy rotator 108, the information regarding a specific proxy server contained in the particular set of suitable parameters. Following which, extractor 106 may amend the original data collection request according to the particular set of suitable parameters. Subsequently, extractor 106 executes the request by adhering to the set of suitable parameters received from optimizer 114.
- Proxy 122 receives the data collection request from extractor 106 and forwards the request to target 124. Consecutively, target 124 responds to the data collection request by providing appropriate response data. Specifically, proxy 122 receives the response data from target 124 and forwards the response data to extractor 106.
- extractor 106 Upon receiving the response data, extractor 106 sends the response data to block detector 120, which evaluates and classifies the response data as either ‘block’ or ‘nonblock’. Extractor 106, among other reasons, sends the response data to block detector 120 to ascertain the effectiveness of the particular set of parameters.
- a set of suitable parameters (complete or partial), identified and selected by optimizer 114 to be effective on a target, might not be successful while executing data collection requests in every instance.
- Web targets (such as Target 124) can respond differently (i.e., can block or decline service) each time for the same set of suitable parameters.
- a ‘block’ classification implies that the response data contain a block or no valid data (or any other technical measure intended to restrict access to data or resources).
- a ‘non-block’ classification implies that the response data contain valid data and can be returned to user device 102.
- Block detector 120 sends the classification decision to extractor 106 coupled with the probability percentile for the classification decision.
- Extractor 106 receives and analyzes the classification decision from block detector 120. Consequently, extractor 106 prepares and sends feedback data to optimizer 114 via gateway 112. Feedback data sent by extractor 106 is intended to communicate the effectiveness of the particular set of suitable parameters in executing the data collection request on target 124.
- the effectiveness of the suitable set of parameters is said to be insufficient when the classification decision received from block detector 120 is a ‘block’.
- the effectiveness of the suitable set of parameters is said to be optimal when the classification decision received from the block detector 120 is a ‘non-block’. Therefore, according to the classification decision, extractor 106 sends the feedback data to optimizer 114 via gateway 112. Feedback data, among other things, may comprise the classification decision, URL of target 124, the particular set of suitable parameters.
- extractor 106 may either: a. forward the response data to user device 102 if the classification decision is a ‘nonblock’ or b. request a different set of suitable parameters again from optimizer 114 via gateway 112 to effectively execute the data collection request if the classification decision is a 'block'.
- the process of requesting a suitable set of parameters and executing the data collection request repeats until the classification decision received from block detector 120 is a ‘non-block’.
- valuation unit 118 scores the particular set of suitable parameters (either the complete set or partial set) according to the received feedback data. Specifically, valuation unit 118 scores based on the feedback data and the overall cost to implement the set of suitable parameters while executing the particular data collection request. For instance, valuation unit 118 may assign the highest score to the set of parameters that has a ‘non-block’ classification decision and lowest implementation cost. Similarly, valuation unit 118 may assign the lowest score to the set of parameters that has a ‘block’ classification decision and highest implementation cost. Valuation unit 118 may score a particular set of suitable parameters based on a specific machine learning algorithm.
- valuation unit 118 After scoring the particular set of suitable parameters, valuation unit 118 sends the score of the particular set of suitable parameters to optimizer 114.
- the score of the particular set of suitable parameters is sent to database 116 by optimizer 114, where the score is stored coupled with the particular set of suitable parameters in database 116.
- FIG. 2A is an exemplary sequence diagram showing the effective execution of a data collection request.
- user device 102 sends a data collection request intending to acquire data from target 124 to extractor 106.
- extractor 106 is present within service provider infrastructure 104.
- the data collection request sent by user device 102 may comprise multiple information, including but not limited to an URL of the target (in this case, URL of target 124) and one or more data collection parameter(s) that must be adhered to while executing the particular request.
- the term ‘URL’ (Uniform Resource Locator) is a reference to a web resource that specifies the location of the web resource on a computer network and a mechanism for retrieving data from the particular web resource. Therefore, the URL of target 124 provides the address/location of target 124 on network 126 and the mechanism for accessing and retrieving data from target 124.
- URL Uniform Resource Locator
- parameters may refer to a wide range of specifications that are necessary to execute data collection requests successfully and efficiently.
- parameters may include specifications about the required type of proxy, the required location of the proxy server, and the required type of operating system.
- proxy type the required type of proxy
- proxy location the required location of the proxy server
- proxy ASN Automatic System Number
- the data collection request originating from user device 102 may include one or more of the above-mentioned parameters such as, for example, proxy location and proxy type depending upon the resources available to user device 102 and on the configuration of the extractor 106.
- extractor 106 may be configured by SPI 104 to disregard every parameter accompanying the URL of target 124.
- extractor 106 submits a request to gateway 112 requesting for a complete set of suitable parameters to execute the particular data collection request on target 124 effectively.
- the request submitted by extractor 106 may comprise the URL of target 124.
- gateway 112 accepts the request from extractor 106 and forwards the request to optimizer 114.
- set of suitable parameters or ‘complete set of suitable parameters’ or ‘complete set’ as described herein may refer to a list of specific parameters identified to be suitable to execute a particular data collection request effectively. Moreover, the number of parameters present in a ‘set’ depends on the policy configuration of extractor 106 and service provider infrastructure 104.
- step 207 optimizer 114, after receiving the request from gateway 112, initiates the process to identify and select the complete set of suitable parameters.
- the steps carried out by optimizer 114 to identify and select the set of suitable parameters include a) accessing and retrieving multiple sets of parameters coupled with their respective scores from database 116; b) implementing any one of the machine learning algorithms such as, for example, the Epsilon Greedy Arm algorithm to process the multiple sets of parameters and ultimately to identify and select the set of suitable parameters.
- step 209 optimizer 114 sends the set of suitable parameters to gateway 112. Consecutively, in step 211, gateway forwards the set of suitable parameters to extractor 106.
- FIG. 2B is the continuation of an exemplary sequence diagram showing the effective execution of a data collection request.
- extractor 106 proceeds to amend the original data collection request according to the complete set of suitable parameters.
- extractor 106 may request proxy rotator 108 (not shown here) to obtain the information (such as for example the IP address) of a specific proxy server (represented here by proxy 122) contained in the complete set of suitable parameters.
- extractor 106 executes the data collection request through proxy 122. Specifically, extractor 106 sends the amended data collection request to proxy 122. Consequently, in step 217, proxy 122 receives and forwards the data collection request to target 124.
- target 124 responds to the data collection request by providing the relevant response data.
- Target 124 sends the response data to proxy 122.
- proxy 122 receives and forwards the response data to extractor 106.
- FIG. 2C is the continuation of an exemplary sequence diagram showing the effective execution of a data collection request.
- extractor 106 sends the response data to block detector 120.
- Block detector 120 receives and evaluates the response data in step 225.
- Block detector 120 may employ several advanced algorithms to evaluate the response data.
- Block detector 120 evaluates the response data by employing multiple advanced algorithms to classify the response data as either ‘block’ or ‘non-block’.
- a ‘block’ classification implies that the response data contain discrepancies or no valid data.
- a ‘non-block’ classification implies that the response data contain valid data and can be returned to user device 102.
- block detector 120 is depicted classifying the response data as ‘non-block’. Accordingly, in step 227, block detector 120 classifies the response data and in step 229 sends the classification decision to extractor 106. In step 231, after receiving the classification decision, extractor 106 prepares and sends feedback data to gateway 112. Feedback data sent by extractor 106 is intended to communicate the effectiveness of the particular set of suitable parameters in executing the data collection request on target 124. [0086] The effectiveness of the suitable set of parameters is said to be insufficient when the classification decision received from block detector 120 is a ‘block’.
- feedback data may comprise the classification decision received from block detector 120, URL of target 124 and the particular set of suitable parameters.
- step 231 the process flow can occur in two concurrent directions: a. Gateway 112 receives the feedback data from extractor 106 and forwards the feedback data to optimizer 114 (shown in Figure 4).
- Figure 4 shows the corresponding steps that are performed in relation to the feedback data. Further in Figure 4, one could observe that optimizer 114 receives and forwards the feedback data to valuation unit 118, where the feedback data is scored and later stored in the database 116. The descriptions of the corresponding steps are detailed in the later parts of this disclosure.
- Extractor 106 forwards the response data to user device 102, shown in step 233 of Figure 2C.
- block detector 120 classifies the response data and in step 229-B sends the classification decision to extractor 106.
- block detector 120 classifies the response data as ‘block’.
- extractor 106 After receiving the classification decision, extractor 106 prepares and sends feedback data to gateway 112. Feedback data sent by extractor 106 is intended to communicate the effectiveness of the particular set of suitable parameters in executing the data collection request on target 124.
- the effectiveness of the suitable set of parameters is said to be insufficient when the classification decision received from block detector 120 is a ‘block’.
- the effectiveness of the suitable set of parameters is said to be optimal when the classification decision received from the block detector 120 is a ‘non-block’. Therefore, according to the classification decision, extractor 106 sends the feedback data to optimizer 114 via gateway 112.
- feedback data may comprise the classification decision received from block detector 120, URL of target 124 and the particular set of suitable parameters.
- Gateway 112 receives the feedback data from extractor 106 and forwards the feedback data to optimizer 114 (shown in Figure 4).
- Figure 4 shows the corresponding steps that are performed in relation with the feedback data. Further in Figure 4, one could observe that optimizer 114 receives and forwards the feedback data to valuation unit 118, where the feedback data is scored and later stored in the database 116. The descriptions of the corresponding steps are detailed in the later parts of this disclosure.
- Extractor 106, step 233 -B submits a new request to gateway 112 requesting for another complete set of suitable parameters to execute the particular data collection request on target 124 effectively.
- the request submitted by extractor 106 may comprise the URL of target 124. Subsequently, steps 205 - 223 are repeated till the response data is classified as ‘non-block’ by block detector 120. After which, extractor 106 forwards the response data to user device 102.
- extractor 106 may be configured by SPI 104 to accept one or more specific parameters accompanying the URL of target 124 and disregard other parameters.
- Figure 3A is an exemplary sequence diagram showing the effective execution of data collection requests by accepting one or more data collection parameters from user device 102.
- step 301 user device 102 sends a data collection request intending to acquire data from target 124 to extractor 106.
- extractor 106 is present within service provider infrastructure 104.
- the data collection request sent by user device 102 may comprise multiple information, including but not limited to an URL of the target (in this case, URL of target 124) and one or more data collection parameter(s) that must be adhered to while executing the particular request.
- extractor 106 After receiving the data collection request from user device 102, in step 303, extractor 106 submits a request to gateway 112 for a partial set of suitable parameters to execute the particular data collection request on target 124 effectively.
- the partial set of suitable parameters will comprise the accepted parameter(s) coupled with several other suitable parameters necessary to effectively execute the particular data collection request.
- the request submitted by extractor 106 may comprise the URL for target 124 and the accepted parameter(s) from user device 102.
- gateway 112 accepts the request from extractor 106 and forwards the request to optimizer 114.
- optimizer 114 initiates the process to identify and select the partial set of suitable parameters.
- the steps carried out by optimizer 114 to identify and select the partial set of suitable parameters include a) accessing and retrieving multiple sets of parameters coupled with their respective scores from database 116; b) implementing any one of the machine learning algorithms such as, for example, the Epsilon Greedy Arm algorithm to process the multiple sets of parameters and ultimately to identify and select the set of suitable parameters.
- optimizer 114 sends the partial set of suitable parameters to gateway 112. Consecutively, in step 311, gateway forwards the partial set of suitable parameters to extractor 106.
- FIG. 3B is the continuation of an exemplary sequence diagram showing the effective execution of a data collection request by accepting one or more data collection parameters from user device 102.
- extractor 106 proceeds to amend the original data collection request according to the partial set of suitable parameters.
- extractor 106 may request proxy rotator 108 (not shown here) to obtain the information (such as for example the IP address) of a specific proxy server (represented here by proxy 122) contained in the complete set of suitable parameters.
- proxy rotator 108 not shown here
- proxy rotator 108 not shown here
- proxy rotator 108 to obtain the information (such as for example the IP address) of a specific proxy server (represented here by proxy 122) contained in the complete set of suitable parameters.
- proxy 122 a specific proxy server
- extractor 106 sends the amended data collection request to proxy 122. Consequently, in step 317, proxy 122 receives and forwards the data collection request to target 124.
- target 124 responds to the data collection request by providing the relevant response data.
- Target 124 sends the response data to proxy 122.
- proxy 122 receives and forwards the response data to extractor 106.
- FIG. 3C is the continuation of an exemplary sequence diagram showing the effective execution of a data collection request by accepting one or more data collection parameters from user device 102.
- extractor 106 sends the response data to block detector 120.
- Block detector 120 receives and evaluates the response data in step 325.
- Block detector 120 may employ several advanced algorithms to evaluate the response data.
- Block detector 120 evaluates the response data by employing multiple advanced algorithms to classify the response data as either ‘block’ or ‘non-block’.
- a ‘block’ classification implies that the response data contain discrepancies or no valid data.
- a ‘non-block’ classification implies that the response data contain valid data and can be returned to user device 102.
- block detector 120 is depicted classifying the response data as ‘non-block’. Accordingly, in step 327, block detector 120 classifies the response data and in step 329 sends the classification decision to extractor 106. In step 331, after receiving the classification decision, extractor 106 prepares and sends feedback data to gateway 112. Feedback data sent by extractor 106 is intended to communicate the effectiveness of the particular set of suitable parameters in executing the data collection request on target 124.
- the effectiveness of the suitable set of parameters is said to be insufficient when the classification decision received from block detector 120 is a ‘block’.
- the effectiveness of the suitable set of parameters is said to be optimal when the classification decision received from the block detector 120 is a ‘non-block’. Therefore, according to the classification decision, extractor 106 sends the feedback data to optimizer 114 via gateway 112.
- feedback data may comprise the classification decision received from block detector 120, URL of target 124 and the particular set of suitable parameters.
- step 331 the process flow can occur in two concurrent directions: c.
- Gateway 112 receives the feedback data from extractor 106 and forwards the feedback data to optimizer 114 (shown in Figure 4).
- Figure 4 shows the corresponding steps that are performed in relation to the feedback data. Further in Figure 4, one could observe that optimizer 114 receives and forwards the feedback data to valuation unit 118, where the feedback data is scored and later stored in the database 116. The descriptions of the corresponding steps are detailed in the later parts of this disclosure.
- Extractor 106 forwards the response data to user device 102, shown in step 333 of Figure 3C.
- Extractorl06 sends the response data to block detector 120 in order to ascertain the effectiveness of the partial set of suitable parameters.
- FIG. 3D shows an alternative flow to 3C, i.e., when block detector 120 classifies the response data as ‘block’.
- extractor 106 sends the response data to block detector 120.
- Block detector 120 receives and evaluates the response data in step 325-B.
- step 327-B block detector 120 classifies the response data and in step 329-B sends the classification decision to extractor 106.
- step 3D block detector classifies the response data as ‘block’.
- step 331-B after receiving the classification decision, extractor 106 prepares and sends feedback data to gateway 112. Feedback data sent by extractor 106 is intended to communicate the effectiveness of the particular set of suitable parameters in executing the data collection request on target 124.
- the effectiveness of the suitable set of parameters is said to be insufficient when the classification decision received from block detector 120 is a ‘block’.
- the effectiveness of the suitable set of parameters is said to be optimal when the classification decision received from the block detector 120 is a ‘non-block’. Therefore, according to the classification decision, extractor 106 sends the feedback data to optimizer 114 via gateway 112.
- feedback data may comprise the classification decision received from block detector 120, URL of target 124 and the particular set of suitable parameters.
- Gateway 112 receives the feedback data from extractor 106 and forwards the feedback data to optimizer 114 (shown in Figure 4).
- Figure 4 shows the corresponding steps that are performed in relation with the feedback data. Further in Figure 4, one could observe that optimizer 114 receives and forwards the feedback data to valuation unit 118, where the feedback data is scored and later stored in the database 116. The descriptions of the corresponding steps are detailed in the later parts of this disclosure.
- Extractor 106, step 333-B submits a new request to gateway 112 requesting for another complete set of suitable parameters to execute the particular data collection request on target 124 effectively.
- the request submitted by extractor 106 may comprise the URL of target 124. Subsequently, steps 305 - 323 are repeated till the response data is classified as ‘non-block’ by block detector 120. After which, extractor 106 forwards the response data to user device 102.
- FIG. 4 is an exemplary sequence diagram showing the flow of feedback data.
- extractor 106 prepares and sends the feedback data to gateway 112 after receiving the classification decision from block detector 120.
- gateway 112 forwards the feedback data to optimizer 114.
- optimizer 114 receives and forwards the feedback data to valuation unit 118.
- valuation unit 118 receives the feedback data and begins the process of calculating the overall cost for the particular set of suitable parameters (the set can either be partial or complete).
- the configuration files present within the valuation unit 118 provides the cost information for each parameter present in the particular set of suitable parameters.
- valuation unit 118 calculates the overall cost for the particular set of suitable parameters.
- Feedback data may comprise the classification decision received from block detector 120, URL of target 124 and the particular set of suitable parameters.
- the valuation unit 118 can access an external element to receive the cost information for each parameter present in the particular set of suitable parameters.
- valuation unit 118 scores the particular set of suitable parameters Specifically, valuation unit 118 scores the set of suitable parameters based on the feedback data and the overall cost to implement the set of suitable parameters while executing the particular data collection request. Certain parameters (such as for example, using a particular type of proxy server, using a proxy server from a certain geo-location) can be expensive therefore, optimizer 114 must be able to identify and select the set of parameters that is both suitable and cost-effective i.e., economical to implement.
- valuation unit 118 may assign the highest score to the set of parameters that has a ‘non-block’ classification decision and lowest implementation cost. Similarly, valuation unit 118 may assign the lowest score to the set of parameters that has a ‘block’ classification decision and highest implementation cost. Moreover, valuation unit 118 may assign the scores based on a specific machine learning algorithm.
- valuation unit 118 sends the assigned score to optimizer 114.
- optimizer 114 receives and forwards the score to database 116.
- the assigned score is stored coupled with the particular set of suitable parameters in database 116.
- the embodiments herein may be combined or collocated in a variety of alternative ways due to design choice. Accordingly, the features and aspects herein are not in any way intended to be limited to any particular embodiment. Furthermore, the embodiments can take the form of hardware, firmware, software, and/or combinations thereof. In one embodiment, such software includes but is not limited to firmware, resident software, microcode, etc.
- Figure 5 illustrates a computing system 500 in which a computer readable medium 503 may provide instruction for performing any methods and processes disclosed herein.
- aspects of the embodiments herein can take the form of a computer program product accessible from the computer readable medium 506 to provide program code for use by or in connection with a computer or any instruction execution system.
- the computer readable medium 506 can be any apparatus that can tangibly store the program code for use by or in connection with the instruction execution system, apparatus, or device, including the computing system 500.
- the computer readable medium 506 can be any tangible electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system (or apparatus or device). Some examples of a computer readable medium 506 include solid state memories, magnetic tapes, removable computer diskettes, random access memories (RAM), read-only memories (ROM), magnetic disks, and optical disks. Some examples of optical disks include read only compact disks (CD-ROM), read/write compact disks (CD-R/W), and digital versatile disks (DVD).
- the computing system 500 can include one or more processors 502 coupled directly or indirectly to memory 508 through a system bus 510.
- the memory 508 can include local memory employed during actual execution of the program code, bulk storage, and/or cache memories, which provide temporary storage of at least some of the program code in order to reduce the number of times the code is retrieved from bulk storage during execution.
- I/O devices 504 can be coupled to the computing system 500 either directly or through intervening I/O controllers.
- Network adapters may also be coupled to the computing system 500 to enable the computing system 500 to couple to other data processing systems, such as through host systems interfaces 512, printers, and/or storage devices through intervening private or public networks. Modems, cable modems, and Ethernet cards are just examples of network adapter types.
- this disclosure presents a method to optimize a scraping request by identifying suitable parameters while executing the scraping request, the method comprising: a) executing a scraping request; b) receiving a result of the scraping request, wherein the result comprises at least: a classification decision, a target domain, and a set of scraping request parameters, wherein the set of scraping request parameters comprises either a full set, which includes an entirety of the scraping request parameters, or a partial set, which includes less than the entirety of the scraping request parameters; c) scoring the set of scraping request parameters to form a scored set of scraping request parameters; d) storing the scored set of scraping request parameters, related to the target domain, with respective scoring results in a database; e) selecting from the database a subsequent scored set of scraping request parameters comprising either the full set or a subsequent scored partial set, which includes less than the entirety of the scraping request parameters and which is not identical to the scored set of scraping request parameters, for a subsequent scraping
- the classification decision in at least one exemplary disclosed method can be a ‘block’ response or a ‘non-block’ response.
- the set of scraping request parameters receiving the ‘non-block’ response of the classification decision receives a higher score than the set of scraping request parameters receiving the ‘block’ response.
- the set of scraping request parameters receiving the ‘block’ response of the classification decision receives a lower score than the set of scraping request parameters receiving the ‘non-block’ response. If the classification decision receives the ‘block’ response for the subsequent scraping request, the method to optimize a scraping request by identifying suitable parameters while executing the scraping request is repeated with a new set of scored scraping request parameters.
- the method to optimize a scraping request by identifying suitable parameters while executing the scraping request is repeated for the subsequent scraping request until the classification decision is the ‘non-block’ response or a maximum threshold of attempts is reached.
- the data from the ‘non-block’ response is used for future scraping actions.
- the exemplary method further discloses that the scoring of the set of scraping request parameters is affected by an amount of overall scraping request cost calculated for the used parameters.
- the partial set of the scraping request parameters comprises a single scraping request parameter or comprises a combination of any of the following parameters: proxy type; proxy location; proxy ASN (Autonomous System Number); operating system preference; browser preference; conditions for headers; Hypertext Transfer Protocol (HTTP) protocol type and version.
- proxy type proxy location
- proxy ASN Autonomous System Number
- operating system preference browser preference
- HTTP Hypertext Transfer Protocol
- the exemplary method disclosed how the machine learning algorithm is modified as new scores for the scraping request parameters are recorded within different sets of parameters. Also, the method teaches that the subsequent scored set of the scraping request parameters can be identical to the set of the scraping request parameters.
- the disclose also presents an exemplary method to increase a quality of data scraping from the internet comprising: a) receiving, by an extractor, a scraping request to a target domain from a user device via a network; b) requesting, by the extractor, from an optimizer via a gateway, a set of scraping request parameters to execute the scraping request; c) receiving, by the optimizer, the scraping request, from the extractor via the gateway; d)responding, by the optimizer, to the scraping request by initiating a process of identifying and selecting a set of the scraping request parameters, with the set comprising either a full set, which includes an entirety of the scraping request parameters, or a partial set, which includes less than the entirety of the scraping request parameters, wherein, the optimizer identifies and selects the set of the scraping request parameters by accessing a database and by applying a machine learning algorithm; e) sending, by the optimizer, the set of the scraping request parameters to the extractor via
- the feedback data is intended to communicate the effectiveness of the set of the scraping request parameters in executing the scraping request to the target domain.
- the feedback data contains at least one of the following: classification decision, target domain, or the set.
- the scraping request comprises at least a URL of the target domain or one or more of the scraping request parameters.
- the extractor upon receiving a subsequent scraping request, disregards every scraping parameter indicated in the subsequent scraping request and requests an optimizer for a complete set of scraping parameters to execute the scraping request on the target domain.
- the extractor uses one or more of the scraping request parameters from the subsequent scraping request and requests a partial selection of the scraping request parameters from the optimizer to execute the scraping request on the target domain.
- the disclosure presents an exemplary method to optimize a scraping request by identifying suitable parameters while executing the scraping request, the method comprising: a)executing a scraping request; b) receiving a result of the scraping request, wherein the result comprises at least: a classification decision, a target domain, and a set of scraping request parameters, wherein the set of scraping request parameters comprises either a full set, which includes an entirety of the scraping request parameters, or a partial set, which includes less than the entirety of the scraping request parameters; c) scoring the set of scraping request parameters to form a scored set of scraping request parameters; d) storing the scored set of scraping request parameters, related to the target domain, with respective scoring results in a database; e) selecting by the machine learning algorithm from the database a subsequent scored set of scraping request parameters comprising either the full set or a subsequent scored partial set, which includes less than the entirety of the scraping request parameters and which includes at least one scraping request parameter distinct from the scored
- a “includes ... a”, “contains ...a” does not, without additional constraints, preclude the existence of additional identical elements in the process, method, article, and/or apparatus that comprises, has, includes, and/or contains the element.
- the terms “a” and “an” are defined as one or more unless explicitly stated otherwise herein.
- the terms “approximately”, “about” or any other version thereof, are defined as being close to as understood by one of ordinary skill in the art.
- a device or structure that is “configured” in a certain way is configured in at least that way but may also be configured in ways that are not listed.
- a singular or plural form can be used, but it does not limit the scope of the disclosure and the same teaching can apply to multiple objects, even if in the current application an object is referred to in its singular form.
- a method to optimize data collection requests by intelligently identifying suitable parameters while executing the particular data collection request comprising: generating a feedback data according to a result of a scraping request, wherein the feedback data comprises at least the classification decision received, URL of a target, a particular set of suitable parameters; setting an overall cost to implement the set of suitable parameters; scoring the particular set of suitable parameters according to the received feedback data and the overall cost, wherein either the complete set or partial set of parameters can be scored and wherein the highest score is assigned to the set of parameters that has a ‘non-block’ classification decision and lowest implementation cost; storing the scoring results in a database; identifying the complete set of suitable parameters for the following scraping request, by: accessing and retrieving multiple sets of parameters coupled with their respective scores from database; and implementing any one of the Machine Learning algorithms to process the multiple sets of parameters; selecting the suitable parameters to implement the following scraping request.
- Aspect 2 The method of aspect 1, wherein the effectiveness of the suitable set of parameters is insufficient when the classification decision received from the block detector is a ‘block’.
- Aspect 3 The method of aspect 1, wherein the effectiveness of the suitable set of parameters is optimal when the classification decision received from the block detector is a ‘non-block’.
- Aspect 4 The method of aspect 1, wherein receiving a ‘block’ response for the following scraping request initiates a repetition of the selection of a new set of suitable parameters and the implementation of the following scraping request is performed with the new set of suitable parameters.
- Aspect 5 The method of aspect 4, wherein the selection of the suitable parameters repeats until the response data is classified as ‘non-block’.
- Aspect 6 The method of aspect 5, wherein the ‘non-block’ response data is forwarded to the user device.
- Aspect 7 The method of aspect 1, wherein the list of parameters that are essential for successful data collection requests are, but are not limited to: proxy type; proxy location; proxy ASN (Autonomous System Number); operating system preference; browser preference; conditions for headers; protocol type and version.
- proxy type proxy location
- proxy ASN Autonomous System Number
- operating system preference browser preference
- conditions for headers protocol type and version.
- a method to increase the quality of data scraping from the Internet comprising: a. receiving by extractor the data collection request from the user device via network; b. requesting by extractor from optimizer via gateway for a set of suitable parameters to efficiently execute the data collection request; c. receiving by optimizer the request for a set of parameters, that can be either for a complete set or a partial set, from extractor via gateway; d. responding by optimizer to the request by initiating the process of identifying and selecting a set of suitable parameters, specifically, optimizer identifies and selects a set of suitable parameters by accessing database and by implementing a Machine Learning algorithm; e.
- extractor does one of the following: forwards the response data to user device if the classification decision is a ‘non-block’, or requests a different set of suitable parameters from optimizer via gateway and repeats the data extraction process.
- Aspect 9 The method of aspect 8, wherein feedback data sent by extractor is intended to communicate the effectiveness of the particular set of suitable parameters in executing the data collection request on target.
- Aspect 10 The method of aspect 9, wherein the collection request, among other things, may comprise an URL of the target coupled with one or more parameters for executing the particular data collection request.
- Aspect 11 The method of aspect 8, wherein the extractor disregards every parameter sent by user device and requests an optimizer for a complete set of suitable parameters to execute the data collection request on target.
- Aspect 12 The method of aspect 11, wherein a complete set of suitable parameters refers to a list of specific parameters suitable to effectively execute the particular data collection request and wherein the number of parameters present in a complete set of suitable parameters depends on the policy configuration of extractor and service provider infrastructure.
- Aspect 13 The method of aspect 8, wherein the extractor accepts one or more parameters sent by user device and disregards the rest of the parameters and communicates the accepted parameter(s) to optimizer while requesting for a partial set of suitable parameters.
- Aspect 14 The method of aspect 13, wherein the partial set of suitable parameters comprises the accepted parameter(s) coupled with several other suitable parameters necessary to effectively execute the particular data collection request.
- Aspect 15 Methods of optimizing data collection as shown and/or described herein.
- Aspect 16 Methods of increasing quality of data scraping as shown and/or described herein.
- Aspect 17 Systems, apparatus, devices, or combinations for optimizing data collection as shown and/or described herein.
- Aspect 18 Systems, apparatus, devices, or combinations for increasing quality of data scraping as shown and/or described herein.
- Aspect 20 Methods, systems, apparatus, devices, or combinations to generate feedback data based upon the effectiveness of the data collection parameters as shown and/or described herein.
- Aspect 21 Methods, systems, apparatus, devices, or combinations to score a set of suitable parameters as shown and/or described herein.
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
Claims
Priority Applications (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
IL308560A IL308560A (en) | 2021-08-24 | 2022-08-03 | Adaptive data collection optimization |
CN202280035855.0A CN117321588A (en) | 2021-08-24 | 2022-08-03 | Adaptive data collection optimization |
EP22761475.7A EP4315109A1 (en) | 2021-08-24 | 2022-08-03 | Adaptive data collection optimization |
CA3214792A CA3214792A1 (en) | 2021-08-24 | 2022-08-03 | Adaptive data collection optimization |
Applications Claiming Priority (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US202163260530P | 2021-08-24 | 2021-08-24 | |
US63/260,530 | 2021-08-24 | ||
US17/454,074 | 2021-11-09 | ||
US17/454,074 US11314833B1 (en) | 2021-08-24 | 2021-11-09 | Adaptive data collection optimization |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2023025552A1 true WO2023025552A1 (en) | 2023-03-02 |
Family
ID=83149566
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/EP2022/071835 WO2023025552A1 (en) | 2021-08-24 | 2022-08-03 | Adaptive data collection optimization |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2023025552A1 (en) |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10965770B1 (en) * | 2020-09-11 | 2021-03-30 | Metacluster It, Uab | Dynamic optimization of request parameters for proxy server |
-
2022
- 2022-08-03 WO PCT/EP2022/071835 patent/WO2023025552A1/en active Application Filing
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10965770B1 (en) * | 2020-09-11 | 2021-03-30 | Metacluster It, Uab | Dynamic optimization of request parameters for proxy server |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US9911143B2 (en) | Methods and systems that categorize and summarize instrumentation-generated events | |
US11893461B2 (en) | System and method for labeling machine learning inputs | |
RU2412476C2 (en) | Application program interface for extracting and searching for text | |
US20170300966A1 (en) | Methods and systems that predict future actions from instrumentation-generated events | |
WO2023280569A1 (en) | Dynamic web page classification in web data collection | |
US9825984B1 (en) | Background analysis of web content | |
WO2021072742A1 (en) | Assessing an impact of an upgrade to computer software | |
US11556560B2 (en) | Intelligent management of a synchronization interval for data of an application or service | |
US20230161766A1 (en) | Data investigation and visualization system | |
Vargas et al. | Characterizing JSON Traffic Patterns on a CDN | |
US11126785B1 (en) | Artificial intelligence system for optimizing network-accessible content | |
US11468137B1 (en) | Adaptive data collection optimization | |
WO2023025552A1 (en) | Adaptive data collection optimization | |
US9843559B2 (en) | Method for determining validity of command and system thereof | |
Jiang et al. | Seq2Path: a sequence-to-path-based flow feature fusion approach for encrypted traffic classification | |
Casas et al. | X-ray goggles for the ISP: improving in-network web and app QoE monitoring with deep learning | |
US11669588B2 (en) | Advanced data collection block identification | |
Jyoti et al. | A Novel Approach for clustering web user sessions using RST | |
US11356470B2 (en) | Method and system for determining network vulnerabilities | |
US11936545B1 (en) | Systems and methods for detecting beaconing communications in aggregated traffic data | |
Ran et al. | Research on Data Acquisition Strategy and Its Application in Web Usage Mining | |
WO2024015980A9 (en) | Devices, systems, and methods for utilizing a networked, computer-assisted, threat hunting platform to enhance network security | |
CN117009690A (en) | Method and system for preloading content | |
Pereira | BotsBFUOD: Web Bot Detection Using Biometric Features and Unsupervised Outlier Detection | |
EP4309054A1 (en) | E-commerce toolkit infrastructure |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 22761475 Country of ref document: EP Kind code of ref document: A1 |
|
WWE | Wipo information: entry into national phase |
Ref document number: 3214792 Country of ref document: CA |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2022761475 Country of ref document: EP |
|
ENP | Entry into the national phase |
Ref document number: 2022761475 Country of ref document: EP Effective date: 20231026 |
|
WWE | Wipo information: entry into national phase |
Ref document number: 308560 Country of ref document: IL |
|
WWE | Wipo information: entry into national phase |
Ref document number: 202280035855.0 Country of ref document: CN |
|
WWE | Wipo information: entry into national phase |
Ref document number: MX/A/2023/014134 Country of ref document: MX |
|
NENP | Non-entry into the national phase |
Ref country code: DE |