CN112256944A

CN112256944A - Automatic website data crawling method based on JMeter

Info

Publication number: CN112256944A
Application number: CN202011156240.2A
Authority: CN
Inventors: 杨雪梅; 唐军; 刘楚雄
Original assignee: Sichuan Changhong Electric Co Ltd
Current assignee: Sichuan Changhong Electric Co Ltd
Priority date: 2020-10-26
Filing date: 2020-10-26
Publication date: 2021-01-22

Abstract

The invention relates to the field of web, in particular to a JMeter-based website data automatic crawling method, which avoids a large amount of complex JS operations involved in a process of crawling data from a front-end interface, and also avoids a long data crawling process or direct crawling failure caused by the limitation of HTTP request times and access frequency of some websites within a certain time. The technical scheme includes that a target website needing data crawling is determined, then data analysis is conducted on the target website, a data interface and attribute information corresponding to the data interface are obtained, a data interface is executed at a JMeter end, whether request parameters and corresponding response results in the data interface meet expected settings or not is checked, parameter dynamic configuration is conducted on the data interface if the request parameters and the corresponding response results meet the expected settings, field parameters are extracted from responses of the data interface and dynamically configured, an output target file is dynamically configured, and after the corresponding dynamic configuration is set, a reverse crawling mechanism is set to start data crawling. The method and the device are applied to automatic crawling of the website data.

Description

Automatic website data crawling method based on JMeter

Technical Field

The invention relates to the field of web, in particular to a method for automatically crawling website data based on JMeter.

Background

The website data crawler is a program for automatically extracting website page data, and can capture and store specific data displayed on a website page into a local file or a database for other projects or development of certain specific functions, such as acquisition of video resources of various movie websites, acquisition of commodity name prices of various shopping websites, acquisition of title contents of various novel website articles, and the like. The crawler is widely applied to practical projects, and plays an irreplaceable role in many web development projects and data support.

The existing website page data crawling is based on page element extraction of a front-end interface, and has the advantages that the data are visual and visual, the data needing to be crawled can be more clearly determined, and the defects are obvious.

In the existing website data crawling process, as shown in fig. 1, a browser simulating a user request is built locally, and an HTTP request is sent through the browser to obtain an HTML webpage required by a service. After the browser finishes loading the HTML webpage, the browser can continue to send the HTTP request to load the JS file embedded in the HTML webpage, render the webpage, and after the browser finishes loading the JS file, codes can be written to simulate mouse operation of a real user. Corresponding information can be obtained by completing relevant simulation operation.

However, the scheme has the following obvious problems that firstly, many websites have limitations on the number of HTTP requests and the access frequency within a certain time, so that data crawling is easy to fail, and the whole process is longer; secondly, data crawling requiring JS operation on page elements can trigger multiple HTTP requests to load JS files embedded in the web pages, and especially when the content of nested pages with irregular front-end page elements is obtained, multiple loading of web page data can be caused, and a large number of network resources are consumed to enable the data crawling to be more complicated.

Disclosure of Invention

The invention aims to provide a JMeter-based website data automatic crawling method, which avoids a large amount of complex JS operations involved in a data crawling process from a front-end interface, and also avoids the problem that the data crawling process is long or the data crawling fails directly due to the limitation of some websites on HTTP request times and access frequency within a certain time.

The invention adopts the following technical scheme to realize the purpose, and the automatic website data crawling method based on JMeter comprises the following steps:

step (1), determining a target website needing data crawling;

step (2), carrying out data analysis on the target website to obtain a data interface and attribute information corresponding to the data interface;

step (3), executing a data interface at the JMeter end, checking whether the request parameters and the corresponding response results in the data interface conform to expected settings, if so, entering step (4), otherwise, debugging the data interface at the JMeter end;

step (4), dynamically configuring parameters of the data interface, dynamically configuring response extraction field parameters of the data interface, and dynamically configuring an output target file;

step (5), after setting the corresponding dynamic configuration, setting a reverse climbing mechanism;

and (6) crawling data in batches, outputting and saving the data to a target file.

Further, in step (2), the attribute information corresponding to the data interface includes: request address, request parameters, request type, request header, and request body.

Further, in step (4), dynamically configuring the data interface includes dynamically configuring parameters in the form of variables.

Further, in step (4), the specific method for extracting the field parameter includes: and adding a post processor behind the data interface, and selecting a JSON extractor and/or a regular expression extractor and/or an XPath extractor to extract parameters.

Further, in step (4), the specific method for dynamically configuring the output target file includes: and adding user parameters or custom variables before the request is executed, and correspondingly configuring the file path and the file name.

Further, in the step (5), a specific method for setting the back-climbing mechanism comprises the following steps: a fixed timer is added under a request execution catalog, the time of the fixed timer is random and variable and is always between 100ms and 1s, each interface request waits for a period of random time to operate, and irregular requests of a user at different times are simulated by setting different interval time for the request execution, so that the system is prevented from shielding.

Further, in step (6), the process of crawling data in batches further includes preventing the same request from being repeatedly executed, and the specific method for preventing the same request from being repeatedly executed includes: target data and target page numbers needing to be crawled are found through analysis of interface response data, a circulation controller is arranged at a request page number level, circulation times are set through the circulation controller according to the target page numbers, a counter is added under the circulation controller, the increment is set to be 1, the counter is automatically added once the request is executed, and the execution is finished when the output value of the counter is equal to the target page numbers.

Further, in the step (6), the process of crawling data in batch further includes crawling of nested web page data, where the nested page includes a first-level page and a second-level page, and the specific method for crawling of the nested web page data includes:

step 601, executing a first-level page interface, and acquiring all commodity identifications in a current page list through a JSON extractor to obtain a commodity identification array;

step 602, adding a ForEach logic controller under a hierarchical directory of the cyclic controller and the page request, wherein the input of the logic controller is a commodity identification array, and the output is a specific identification of each commodity;

and 603, circulating the specific identification of each commodity and the corresponding commodity detail interface request through the ForEach logic controller, and completing target data storage of the nested page through JSON extraction and file output of a post processor.

Further, in step (6), the specific method for crawling data in batches and outputting and saving the data includes: adding a post processor BeanShell Postprocessor under a request data level, acquiring parameters by a vars.get method, expanding a commodity identification array in the BeanShell Postprocessor to obtain target data, and sequentially storing the target data in a target file.

The invention is based on the response data extraction of the back-end interface, thus avoiding a large amount of complex JS operations involved in the process of crawling data from the front-end interface; a reverse crawling mechanism is set, the interval time of data crawling is defined in a user-defined mode, and the problem that the data crawling process is long or the data crawling fails directly due to the fact that some websites limit the HTTP request times and the access frequency within a certain time is avoided; the data interface configuration based on JMeter prevents the repeated execution of the same request and the capture of nested webpage data, and greatly improves the efficiency of data crawling.

Drawings

Fig. 1 is a schematic flow chart of a method for crawling data through a front-end interface in the prior art.

FIG. 2 is a flow chart of the method for automatically crawling website data based on JMeter according to the present invention.

Detailed Description

The invention discloses a JMeter-based website data automatic crawling method, which comprises the following steps:

step (1), determining a target website needing data crawling;

In step (2), the attribute information corresponding to the data interface includes: request address, request parameters, request type, request header, and request body.

In the step (4), the dynamic configuration of the data interface includes dynamic configuration of parameters in a variable form.

In the step (4), the specific method for extracting the field parameters comprises the following steps: and adding a post processor behind the data interface, and selecting a JSON extractor and/or a regular expression extractor and/or an XPath extractor to extract parameters.

In the step (4), the specific method for dynamically configuring the output target file includes: and adding user parameters or custom variables before the request is executed, and correspondingly configuring the file path and the file name.

In the step (5), the specific method for setting the reverse climbing mechanism comprises the following steps: a fixed timer is added under a request execution catalog, the time of the fixed timer is random and variable and is always between 100ms and 1s, each interface request waits for a period of random time to operate, and irregular requests of a user at different times are simulated by setting different interval time for the request execution, so that the system is prevented from shielding.

In step (6), the process of crawling data in batches further includes preventing the same request from being repeatedly executed, and the specific method for preventing the same request from being repeatedly executed includes: target data and target page numbers needing to be crawled are found through analysis of interface response data, a circulation controller is arranged at a request page number level, circulation times are set through the circulation controller according to the target page numbers, a counter is added under the circulation controller, the increment is set to be 1, the counter is automatically added once the request is executed, and the execution is finished when the output value of the counter is equal to the target page numbers.

In step (6), the process of crawling data in batch further includes crawling of nested web page data, the nested page includes a first-level page and a second-level page, and the specific method for crawling of the nested web page data includes:

In the step (6), the specific method for crawling data in batches and outputting and storing the data comprises the following steps: adding a post processor BeanShell Postprocessor under a request data level, acquiring parameters by a vars.get method, expanding a commodity identification array in the BeanShell Postprocessor to obtain target data, and sequentially storing the target data in a target file.

The following describes the present scheme in further detail with reference to fig. 2 and a specific embodiment, and the specific work flow of the method for automatically crawling website data based on JMeter according to the present invention is as follows:

the method comprises the following steps of target interface acquisition and debugging, for specific website data acquisition, the first step is to take an interface for data generation, taking a shopping website as an example, if a user needs to crawl a commodity title, a commodity introduction, a commodity address, a commodity price and 5 parameters of the current time and commodity evaluation contents under each commodity detail page for later data comparison or project data support, finding a source interface of data through page data analysis and an F12 developer tool, and acquiring related attributes of the interface: the method comprises the steps of executing an acquired interface at a JMeter end, checking whether the content of request parameters and response results meets expectations or not, and debugging a data interface at the JMeter end if the content of the request parameters and the content of the response results do not meet the expectations, wherein a commodity list page is a first-level page, a commodity detail page is a second-level page, and the two pages are combined into a nested page.

Interface configuration, take the interface smoothly after, need consider next how to go to realize a batch of data and acquire and save the operation, reach the purpose of automatic reptile, at first need accomplish leading operation: dynamic configuration of interface parameters, dynamic configuration of interface response extraction field parameters and dynamic configuration of output files.

The purpose of parameter transmission configuration is to enable the execution of the interface to be more flexible, when data crawling of website data possibly involving a plurality of types, a plurality of states and a plurality of page numbers is acquired, the interface is not written to be dead, parameter configuration is performed through a variable form, so that a crawling script can be more flexible, if a woman, a child and a jacket are selected according to the classification of a current shopping website, the data of the previous 10 pages are arranged according to the sales volume, and then the types can be: data such as women, children, coats, sales volume, page number and the like are configured on the JMeter end in a variable mode, so that modification of types and the like at a later stage becomes easier.

The configuration of the interface response is variable, and the actual configuration needs to be performed according to the actual scene, for example, the target data of the current level page includes: adding a post processor behind an interface, and selecting a JSON extractor/regular expression extractor/XPath extractor to extract parameters, wherein the JSON extractor is used for extracting target parameters, the JSON extractor is added according to the number of parameters to be extracted, and the title, introduction, address, price and time parameters of all commodities are extracted in an array format.

Such as the parameter title, and the directory structure of the data is:

data- > [ { goodsId1, title1, content1, site1, price1, time1}, { goodsId2, title2, content 2, site2, price2, time2}. ], the format of the extractor: $. data.

After being extracted in this way: title _1 is the first title, title _2 is the second title, and so on until all the product titles are traversed, the same way is used for extracting the product introduction, the address, the price and the time, and for XXX _1, the format is a JMeter fixed array data extraction format and is not written in an excessively entangled way.

For the output file and address of the last crawled data, definition should be performed before the saving operation is performed, so that the configuration of the directory and the file name of the file is more flexible, specifically, user parameters or custom variables are added before the request is performed, and the file path and the file name are specifically configured. Such as address: path E: \ \ crawlingtest.

The invention provides a simple anti-crawling mechanism, which can simulate the irregular request of a user at different time by setting different interval time for the request execution and prevent the request from being shielded by the system. Specifically, a fixed timer is added under a request execution directory, the time of the fixed timer is Random and variable and is always between 100ms and 1s, so that each interface request waits for a Random time to operate, on one hand, the requests can be prevented from reaching the server in a large amount to form pressure when batch data is crawled, on the other hand, the time access limit set by simple websites can be bypassed, and a reverse crawling effect is achieved.

In the data crawling process, repeated execution of the same request needs to be prevented, crawling of data is often large in quantity and multiple in pages, but for content crawling of more complex webpage parameters, a corresponding JS component needs to be loaded, and repeated execution of the same request is easily caused when the data of a complex front-end page is involved. Therefore, the invention provides a method for capturing target data from a back-end interface response, which avoids repeated execution of the same request caused by repeated loading of a JS file at the front end, and particularly finds target data and target page numbers to be crawled through analysis of interface response data, if the content of the first 10 pages of a shopping website needs to be crawled according to sales, namely totalpage is 10, all page numbers can be directly obtained through an interface, and data of all page numbers can be crawled. After totalpage is determined, the cycle number can be set directly through a cycle controller: for 10 pages of data, 10 requests are executed, and each request will obtain all target data of the current page. In order to prevent the request from being repeatedly executed, a counter Maximum value is added under the loop controller, the increment is set to be 1, so that the counter is automatically increased by one at the end of each request, the value of the counter is output to be num, parameter is transmitted through an interface, and num can be transmitted to a page number variable in the request, so that the data of each page is executed until the whole loop is ended, and the execution is only 1 time.

In the data crawling process, the nested web pages also need to be configured so as to perform data crawling, and the simple nested pages may include a first-level page and a second-level page, for example: and (3) searching jackets in a certain shopping website, wherein the search result is a first-level page, any result in the list is clicked, and the entered detail page is a second-level page.

The format of the request address of the commodity detail page is as follows: protocol type// server address/path/goods id.

If the commodity evaluation under the commodity details needs to be captured, the specific implementation is data capture of a nested webpage. Executing a first-level page interface, acquiring all commodity identifications goodsId in a current list through a JSON extractor, wherein the acquired result is an array, adding a ForEach logic controller under the hierarchical directory of the cyclic controller and the specific page number request, and inputting the controller into the goodsId array and outputting the controller into id, so that id1 is the identification of a first commodity, and id2 is the identification of a second commodity. And circulating the specific identification of each commodity and the corresponding commodity detail interface request through the ForEach logic controller, and successfully finishing the target data storage of the nested page through the JSON extraction and file output of the post processor.

Through the automatic execution request of the cycle controller, the batch test is completed until the cycle is finished, the manual data record storage is avoided, and the data crawling efficiency is improved. The result storage is realized through Java codes, the Java codes can be directly embedded in the JMeter, and the realization is simpler. The extraction of the target data which we want to crawl is completed through the request cycle control and counting in the steps and the JSON extractor of the interface adding post processor in the steps, and then the processing and the storage of the crawled data are realized. The specific operation is that a post processor, namely the BeanShell Postprocessor, is added under a request, the parameter variable is obtained through a vars.get method, and the data extractor stores an array, so that the array needs to be expanded in the BeanShell Postprocessor to obtain target data, and the target data is sequentially stored in the file path configured in the step.

And (3) output file configuration: file ═ new File ("vars.get (" path "));

crawling data array length determination: get, get ("name _ matchNr");

and (4) circularly outputting to the target file: for (int i ═ 1; i < ═ integer.

The method is particularly suitable for data crawling of a single page or a nested page of a website, and the core idea of the method is also suitable for crawling of all data contents responded by interfaces, such as APP WeChat small programs and the like.

In conclusion, the method and the device avoid a large amount of complex JS operations involved in the data crawling process from the front-end interface, avoid the data crawling process from being lengthy or directly failed due to the limitation of HTTP request times and access frequency of some websites within a certain time, and improve the data crawling efficiency.

Claims

1. A method for automatically crawling website data based on JMeter is characterized by comprising the following steps:

step (1), determining a target website needing data crawling;

2. The JMeter-based website data automatic crawling method as claimed in claim 1, wherein in step (2), the attribute information corresponding to the data interface comprises: request address, request parameters, request type, request header, and request body.

3. The JMeter-based website data auto-crawling method as claimed in claim 1, wherein in step (4), the dynamic configuration of the data interface comprises dynamic configuration of parameters in the form of variables.

4. The JMeter-based website data automatic crawling method as claimed in claim 1, wherein in step (4), the specific method for extracting the field parameters comprises: and adding a post processor behind the data interface, and selecting a JSON extractor and/or a regular expression extractor and/or an XPath extractor to extract parameters.

5. The JMeter-based website data automatic crawling method as claimed in claim 1, wherein in step (4), the specific method for dynamically configuring the output target file comprises: and adding user parameters or custom variables before the request is executed, and correspondingly configuring the file path and the file name.

6. The method for automatically crawling JMeter-based website data as claimed in claim 1, wherein in step (5), the specific method for setting the reverse crawling mechanism comprises: a fixed timer is added under a request execution catalog, the time of the fixed timer is random and variable and is always between 100ms and 1s, each interface request waits for a period of random time to operate, and irregular requests of a user at different times are simulated by setting different interval time for the request execution, so that the system is prevented from shielding.

7. The JMeter-based website data automatic crawling method as claimed in claim 4, wherein in step (6), the process of crawling data in batches further comprises preventing the same request from being repeatedly executed, and the specific method for preventing the same request from being repeatedly executed comprises: target data and target page numbers needing to be crawled are found through analysis of interface response data, a circulation controller is arranged at a request page number level, circulation times are set through the circulation controller according to the target page numbers, a counter is added under the circulation controller, the increment is set to be 1, the counter is automatically added every time the request is executed, and the execution is finished when the output value of the counter is equal to the target page numbers.

8. The JMeter-based website data automatic crawling method as claimed in claim 7, wherein in step (6), the process of crawling data in batch further comprises crawling of nested webpage data, the nested pages comprise a first level page and a second level page, and the specific method for crawling of nested webpage data comprises:

9. The JMeter-based website data automatic crawling method as claimed in claim 8, wherein in step (6), the specific method for crawling data in batches and outputting and saving the data comprises: adding a post processor BeanShell Postprocessor under a request data level, acquiring parameters by a vars.get method, expanding a commodity identification array in the BeanShell Postprocessor to obtain target data, and sequentially storing the target data in a target file.