Disclosure of Invention
The invention aims to solve the defects in the prior art and provides a system and a method for automatically collecting webpage data.
In order to realize the purpose, the invention adopts the following technical scheme:
a system for automatically collecting webpage data comprises an embedded browser, an API interface, a script engine module and a process control module, wherein the API interface, the script engine module and the process control module are respectively embedded into the embedded browser. The embedded browser adopts an IE kernel or a Chrome kernel or other browser kernels.
Preferably, the script engine module is used for loading the JS script; the JS script contains a custom JS function for operating the webpage, and after webpage data are loaded into a computer memory, the JS script is loaded into the script engine module and used for executing the custom JS function in a memory address of the current page so as to support a webpage data acquisition process.
Preferably, the flow control module is configured to carry and execute a batch command, and execute a preconfigured data collection flow;
preferably, the batch processing command is a click of a query button, a jump of a page, or a collection of web page data.
Preferably, the script engine module and the process control module are further used in combination to simulate a user to input a username and a password on a landing-limited web page, simulate a user click behavior, and pass a login authentication. (how to realize in detail)
According to another aspect of the present invention, there is also provided a method for automatically collecting web page data, including the following steps:
step S10: a platform database issues a specified data acquisition request;
step S20: logging in a website to be collected: the embedded browser receives a specified data acquisition request and accesses a specified website to be acquired, receives a page loading event after successful access, and simultaneously acquires a memory address after page loading is completed;
step S30: loading a JS script: the script engine module loads a JS script for the current page and executes a custom JS function in the memory address of the current page;
step S40: executing a preconfigured data collection procedure: the flow control module executes a batch processing command according to a preconfigured flow, executes the batch processing command step by step according to the batch processing execution flow, and acquires specified data from a preconfigured page;
step S50: uploading an acquisition result: and uploading the acquired specified data to the platform database through a network.
Preferably, in the step S20, when the specified website to be collected has a login limitation, the script engine module and the process control module simulate a user to input a user name and a password, simulate a user click behavior, and pass login verification.
Compared with the prior art, the invention has the following beneficial effects:
(1) The embedded browser has the advantages that the script engine module and the process control module are added on the basis of the embedded browser, the two modules are combined to achieve automatic access and collection of the specified webpage, collected contents on the specific webpage can be customized through the process control module, and the embedded browser is suitable for accurately processing data of the specific webpage or specially processing the specific webpage, and particularly can accurately collect data of a tax website; the collection flow and the collection content can be customized;
(2) Aiming at the webpage with login limitation, a script engine module and a process control module can be used for simulating a user to input a user name and a password, simulating a user click behavior, and performing automatic data acquisition through login verification.
Detailed Description
In order to further understand the objects, structures, features, and functions of the present invention, the following embodiments are described in detail.
Example 1: referring to fig. 1, fig. 1 is a structural diagram of a system for automatically collecting web page data according to embodiment 1 of the present invention, where the system for automatically collecting web page data according to embodiment 1 of the present invention includes an embedded browser 1, an API interface 2, a script engine module 3 and a process control module 4, and the API interface 2, the script engine module 3 and the process control module 4 are respectively embedded in the embedded browser 1. The system for automatically acquiring webpage data combines the script engine module 3 and the flow control module 4 to jointly realize the access to the specified webpage and the acquisition of the specified data.
Preferably, the script engine module 3 is used for loading the JS script; the JS script comprises a custom JS function for operating the webpage, and the execution action on the webpage requires the JS script to be interpreted and executed; after the webpage data are loaded into the memory of the computer, the JS script is loaded into the script engine module 3 and used for executing the custom JS function in the memory address of the current page and supporting the webpage data acquisition process. The script engine module 3 enables the system for automatically acquiring Web page data of the present invention to have the capability of executing the customized JS function in the memory address of the current page, and the script engine module 3 can acquire the memory address of the current page after the Web page is loaded, and simulate various clicking operations of the user by using the JS script to acquire the content on the dom element (i.e., the object and the element on the Web page).
Preferably, the process control module 4 is configured to carry and execute a batch command, and execute a pre-configured data collection process; the batch processing commands are clicking of a query button, jumping of a page or collecting web page data, and each command may be clicking of a query button, jumping of a page or collecting web page data. The traditional automatic acquisition system only acquires page data in batches according to a fixed acquisition algorithm, but cannot perform different special processing aiming at different pages, and the flow control module 4 supports flow custom control, supports randomly customized acquisition contents, has stronger flexibility and especially has incomparable advantages in the aspect of accurately acquiring tax website data.
The traditional automatic acquisition system cannot acquire data of a webpage with login limitation, and has great limitation. The script engine module 3 and the process control module 4 are combined together and are also used for simulating a user to input a user name and a password on a webpage with limited login, simulating the clicking behavior of the user and passing login verification.
Example 2: according to another aspect of the present invention, a method for automatically collecting web page data is further provided, please refer to fig. 2, fig. 2 is a flowchart of a method for automatically collecting web page data according to embodiment 1 of the present invention, and the method for automatically collecting web page data according to embodiment 1 of the present invention includes the following steps:
step S10: a platform database issues a specified data acquisition request;
step S20: logging in a website to be collected: the embedded browser 1 receives a specified data acquisition request, accesses a specified website to be acquired, receives a page loading event after successful access, and simultaneously acquires a memory address after page loading is completed;
step S30: loading the JS script: the script engine module 3 loads a JS script for the current page and executes a custom JS function in the memory address of the current page;
step S40: executing a preconfigured data collection procedure: the flow control module 4 executes the batch processing command according to the pre-configured flow, executes the batch processing step by step according to the batch processing execution flow, and acquires the designated data from the pre-configured page;
step S50: uploading an acquisition result: and uploading the collected specified data to a platform database through a network.
Preferably, in step S20, when the designated website to be collected has a login limitation, the script engine module 3 and the process control module 4 simulate a user to input a user name and a password, simulate a user click behavior, and pass login authentication.
Example 3: the system and the method for automatically acquiring the webpage data have wide application scenes, for example, the system and the method can be applied to acquiring the webpage data of a tax website, providing intelligent finance and tax service for a client, logging in a tax office website by using account information provided by the client, acquiring related finance and tax data information, acquiring basic information and financial information of the client on the tax website, providing data support for the intelligent finance and tax service, and providing various value-added services such as automatic tax return, risk assessment and the like for the client.
The data of the tax website is collected as an example, and the workflow of the application program is described.
The first step is as follows: the embedded browser accesses the tax website, receives a page loading event after successful access, and simultaneously acquires a memory address after page loading is completed.
The second step is that: and loading the JS script for the current page through a script engine. The script engine gives us the ability to execute a custom JS function in the memory address of the current page.
The third step: the batch processing command is executed by the flow control (pre-configured flow) module, and the batch processing command is executed step by step according to the batch processing execution flow to acquire element data on a pre-configured (designated) page, so that the user-defined flow is realized.
The fourth step: and uploading the acquired specified data to a platform database of the company through a network.
Wherein:
the script engine: and loading a program module of the JS script, wherein the execution action on the webpage needs to be interpreted and executed by the JS script. The JS script contains various custom JS functions of the operation webpage. The file is stored in the hard disk, and after the webpage is loaded into the memory, the JS script file is simultaneously loaded into the script engine module to be used for executing various user-defined JS function supporting and collecting processes.
A flow control module: the method is mainly used for carrying and executing batch commands, and each command can be a click of a query button, a jump of a page or data collection on the page.
The system for automatically acquiring webpage data adds the script engine module 3 and the process control module 4 on the basis of the embedded browser 1, realizes the automatic access and acquisition of the appointed webpage by combining the two modules, can customize the acquisition content on the specific webpage through the process control module 4, is suitable for accurately processing the data of the specific webpage or specially processing the specific webpage, and particularly can accurately acquire the data of a tax website; the collection flow and the collection content can be customized; aiming at the webpage with login limitation, the invention can simulate the user to input a user name and a password by using the script engine module 3 and the process control module 4, simulate the clicking behavior of the user, and carry out automatic data acquisition through login verification.
The present invention has been described in relation to the above embodiments, which are only exemplary of the implementation of the present invention. It should be noted that the disclosed embodiments do not limit the scope of the invention. Rather, it is intended that all such modifications and variations be included within the spirit and scope of this invention.