CN110489698B - System and method for automatically collecting webpage data - Google Patents

System and method for automatically collecting webpage data Download PDF

Info

Publication number
CN110489698B
CN110489698B CN201910757991.0A CN201910757991A CN110489698B CN 110489698 B CN110489698 B CN 110489698B CN 201910757991 A CN201910757991 A CN 201910757991A CN 110489698 B CN110489698 B CN 110489698B
Authority
CN
China
Prior art keywords
webpage
data
script
control module
script engine
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910757991.0A
Other languages
Chinese (zh)
Other versions
CN110489698A (en
Inventor
李沁
李娜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Cloud Accounting Room Network Technology Co ltd
Original Assignee
Cloud Accounting Room Network Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Cloud Accounting Room Network Technology Co ltd filed Critical Cloud Accounting Room Network Technology Co ltd
Priority to CN201910757991.0A priority Critical patent/CN110489698B/en
Publication of CN110489698A publication Critical patent/CN110489698A/en
Application granted granted Critical
Publication of CN110489698B publication Critical patent/CN110489698B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/445Program loading or initiating
    • G06F9/44521Dynamic linking or loading; Link editing at or after load time, e.g. Java class loading
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/30Authentication, i.e. establishing the identity or authorisation of security principals
    • G06F21/31User authentication
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention provides a system and a method for automatically collecting webpage data, wherein the system comprises an embedded browser, an API (application program interface), a script engine module and a process control module, and the script engine module and the process control module are combined to jointly realize the access of a specified webpage and the collection of specified data. The script engine module enables the system for automatically acquiring webpage data to have the capability of executing the customized JS function in the memory address of the current page, the memory address of the current page can be acquired after the webpage is loaded, various clicking operations of a user are simulated by using the JS script, the acquired content on the specific page can be customized by the process control module, and the system is suitable for accurately processing the data of the specific webpage or specially processing the specific webpage, and particularly can accurately acquire the data of a tax website; the collection flow and the collection content can be customized.

Description

System and method for automatically collecting webpage data
Technical Field
The invention relates to the technical field of website data acquisition, in particular to a system and a method for automatically acquiring webpage data.
Background
At present, the mode of capturing webpage data on the internet is mainly to download webpages on the internet through a scheduling program (crawler), input the webpages into a database, and collect, summarize and classify information of the database according to a specific calculation mode, wherein the calculation mode is divided into a depth-first mode and a breadth-first mode. The method for capturing the webpage data is applied to hundreds of spider crawlers, and the method for capturing the webpage data can automatically acquire data from the webpage in a large batch. However, the data crawling policy of the crawler has universality, so that the data of a specific webpage cannot be accurately processed or the data of a specific webpage cannot be specially processed, and particularly, the data of a tax website cannot be accurately acquired.
Disclosure of Invention
The invention aims to solve the defects in the prior art and provides a system and a method for automatically collecting webpage data.
In order to realize the purpose, the invention adopts the following technical scheme:
a system for automatically collecting webpage data comprises an embedded browser, an API interface, a script engine module and a process control module, wherein the API interface, the script engine module and the process control module are respectively embedded into the embedded browser. The embedded browser adopts an IE kernel or a Chrome kernel or other browser kernels.
Preferably, the script engine module is used for loading the JS script; the JS script contains a custom JS function for operating the webpage, and after webpage data are loaded into a computer memory, the JS script is loaded into the script engine module and used for executing the custom JS function in a memory address of the current page so as to support a webpage data acquisition process.
Preferably, the flow control module is configured to carry and execute a batch command, and execute a preconfigured data collection flow;
preferably, the batch processing command is a click of a query button, a jump of a page, or a collection of web page data.
Preferably, the script engine module and the process control module are further used in combination to simulate a user to input a username and a password on a landing-limited web page, simulate a user click behavior, and pass a login authentication. (how to realize in detail)
According to another aspect of the present invention, there is also provided a method for automatically collecting web page data, including the following steps:
step S10: a platform database issues a specified data acquisition request;
step S20: logging in a website to be collected: the embedded browser receives a specified data acquisition request and accesses a specified website to be acquired, receives a page loading event after successful access, and simultaneously acquires a memory address after page loading is completed;
step S30: loading a JS script: the script engine module loads a JS script for the current page and executes a custom JS function in the memory address of the current page;
step S40: executing a preconfigured data collection procedure: the flow control module executes a batch processing command according to a preconfigured flow, executes the batch processing command step by step according to the batch processing execution flow, and acquires specified data from a preconfigured page;
step S50: uploading an acquisition result: and uploading the acquired specified data to the platform database through a network.
Preferably, in the step S20, when the specified website to be collected has a login limitation, the script engine module and the process control module simulate a user to input a user name and a password, simulate a user click behavior, and pass login verification.
Compared with the prior art, the invention has the following beneficial effects:
(1) The embedded browser has the advantages that the script engine module and the process control module are added on the basis of the embedded browser, the two modules are combined to achieve automatic access and collection of the specified webpage, collected contents on the specific webpage can be customized through the process control module, and the embedded browser is suitable for accurately processing data of the specific webpage or specially processing the specific webpage, and particularly can accurately collect data of a tax website; the collection flow and the collection content can be customized;
(2) Aiming at the webpage with login limitation, a script engine module and a process control module can be used for simulating a user to input a user name and a password, simulating a user click behavior, and performing automatic data acquisition through login verification.
Drawings
Fig. 1 is a structural diagram of a system for automatically collecting web page data according to embodiment 1 of the present invention;
fig. 2 is a flowchart of a method for automatically collecting web page data according to embodiment 1 of the present invention.
The system comprises an embedded browser 1, an API 2, a script engine 3 and a flow control module 4.
Detailed Description
In order to further understand the objects, structures, features, and functions of the present invention, the following embodiments are described in detail.
Example 1: referring to fig. 1, fig. 1 is a structural diagram of a system for automatically collecting web page data according to embodiment 1 of the present invention, where the system for automatically collecting web page data according to embodiment 1 of the present invention includes an embedded browser 1, an API interface 2, a script engine module 3 and a process control module 4, and the API interface 2, the script engine module 3 and the process control module 4 are respectively embedded in the embedded browser 1. The system for automatically acquiring webpage data combines the script engine module 3 and the flow control module 4 to jointly realize the access to the specified webpage and the acquisition of the specified data.
Preferably, the script engine module 3 is used for loading the JS script; the JS script comprises a custom JS function for operating the webpage, and the execution action on the webpage requires the JS script to be interpreted and executed; after the webpage data are loaded into the memory of the computer, the JS script is loaded into the script engine module 3 and used for executing the custom JS function in the memory address of the current page and supporting the webpage data acquisition process. The script engine module 3 enables the system for automatically acquiring Web page data of the present invention to have the capability of executing the customized JS function in the memory address of the current page, and the script engine module 3 can acquire the memory address of the current page after the Web page is loaded, and simulate various clicking operations of the user by using the JS script to acquire the content on the dom element (i.e., the object and the element on the Web page).
Preferably, the process control module 4 is configured to carry and execute a batch command, and execute a pre-configured data collection process; the batch processing commands are clicking of a query button, jumping of a page or collecting web page data, and each command may be clicking of a query button, jumping of a page or collecting web page data. The traditional automatic acquisition system only acquires page data in batches according to a fixed acquisition algorithm, but cannot perform different special processing aiming at different pages, and the flow control module 4 supports flow custom control, supports randomly customized acquisition contents, has stronger flexibility and especially has incomparable advantages in the aspect of accurately acquiring tax website data.
The traditional automatic acquisition system cannot acquire data of a webpage with login limitation, and has great limitation. The script engine module 3 and the process control module 4 are combined together and are also used for simulating a user to input a user name and a password on a webpage with limited login, simulating the clicking behavior of the user and passing login verification.
Example 2: according to another aspect of the present invention, a method for automatically collecting web page data is further provided, please refer to fig. 2, fig. 2 is a flowchart of a method for automatically collecting web page data according to embodiment 1 of the present invention, and the method for automatically collecting web page data according to embodiment 1 of the present invention includes the following steps:
step S10: a platform database issues a specified data acquisition request;
step S20: logging in a website to be collected: the embedded browser 1 receives a specified data acquisition request, accesses a specified website to be acquired, receives a page loading event after successful access, and simultaneously acquires a memory address after page loading is completed;
step S30: loading the JS script: the script engine module 3 loads a JS script for the current page and executes a custom JS function in the memory address of the current page;
step S40: executing a preconfigured data collection procedure: the flow control module 4 executes the batch processing command according to the pre-configured flow, executes the batch processing step by step according to the batch processing execution flow, and acquires the designated data from the pre-configured page;
step S50: uploading an acquisition result: and uploading the collected specified data to a platform database through a network.
Preferably, in step S20, when the designated website to be collected has a login limitation, the script engine module 3 and the process control module 4 simulate a user to input a user name and a password, simulate a user click behavior, and pass login authentication.
Example 3: the system and the method for automatically acquiring the webpage data have wide application scenes, for example, the system and the method can be applied to acquiring the webpage data of a tax website, providing intelligent finance and tax service for a client, logging in a tax office website by using account information provided by the client, acquiring related finance and tax data information, acquiring basic information and financial information of the client on the tax website, providing data support for the intelligent finance and tax service, and providing various value-added services such as automatic tax return, risk assessment and the like for the client.
The data of the tax website is collected as an example, and the workflow of the application program is described.
The first step is as follows: the embedded browser accesses the tax website, receives a page loading event after successful access, and simultaneously acquires a memory address after page loading is completed.
The second step is that: and loading the JS script for the current page through a script engine. The script engine gives us the ability to execute a custom JS function in the memory address of the current page.
The third step: the batch processing command is executed by the flow control (pre-configured flow) module, and the batch processing command is executed step by step according to the batch processing execution flow to acquire element data on a pre-configured (designated) page, so that the user-defined flow is realized.
The fourth step: and uploading the acquired specified data to a platform database of the company through a network.
Wherein:
the script engine: and loading a program module of the JS script, wherein the execution action on the webpage needs to be interpreted and executed by the JS script. The JS script contains various custom JS functions of the operation webpage. The file is stored in the hard disk, and after the webpage is loaded into the memory, the JS script file is simultaneously loaded into the script engine module to be used for executing various user-defined JS function supporting and collecting processes.
A flow control module: the method is mainly used for carrying and executing batch commands, and each command can be a click of a query button, a jump of a page or data collection on the page.
The system for automatically acquiring webpage data adds the script engine module 3 and the process control module 4 on the basis of the embedded browser 1, realizes the automatic access and acquisition of the appointed webpage by combining the two modules, can customize the acquisition content on the specific webpage through the process control module 4, is suitable for accurately processing the data of the specific webpage or specially processing the specific webpage, and particularly can accurately acquire the data of a tax website; the collection flow and the collection content can be customized; aiming at the webpage with login limitation, the invention can simulate the user to input a user name and a password by using the script engine module 3 and the process control module 4, simulate the clicking behavior of the user, and carry out automatic data acquisition through login verification.
The present invention has been described in relation to the above embodiments, which are only exemplary of the implementation of the present invention. It should be noted that the disclosed embodiments do not limit the scope of the invention. Rather, it is intended that all such modifications and variations be included within the spirit and scope of this invention.

Claims (2)

1. A system for automatically collecting webpage data is characterized in that: the system comprises an embedded browser, an API (application program interface), a script engine module and a process control module, wherein the API, the script engine module and the process control module are embedded into the embedded browser; the script engine module is used for loading the JS script; the JS script comprises a user-defined JS function for operating the webpage, and after webpage data are loaded into a memory of the computer, the JS script is loaded into the script engine module and used for executing the user-defined JS function in the memory address of the current webpage and supporting the webpage data acquisition process; the flow control module is used for bearing and executing a batch processing command and executing a pre-configured data acquisition flow; the batch processing command is the click of a query button, the jump of a page or the acquisition of webpage data; the script engine module and the process control module are combined together and are further used for simulating a user to input a user name and a password on a webpage with limited login, simulating a user clicking behavior and passing login verification.
2. A method for automatically collecting webpage data is characterized in that: the method comprises the following steps:
step S10: a platform database issues a specified data acquisition request;
step S20: logging in a website to be collected: the embedded browser receives a specified data acquisition request and accesses a specified website to be acquired, receives a page loading event after successful access, and simultaneously acquires a memory address after page loading is completed; when the specified website to be collected has login limitation, the script engine module and the process control module simulate a user to input a user name and a password, simulate a user clicking behavior and pass login verification;
step S30: loading the JS script: the script engine module loads a JS script for the current page and executes a custom JS function in the memory address of the current page;
step S40: executing a preconfigured data collection procedure: the flow control module executes a batch processing command according to a preconfigured flow, executes the batch processing command step by step according to the batch processing execution flow, and acquires specified data from a preconfigured page;
step S50: uploading an acquisition result: and uploading the acquired specified data to the platform database through a network.
CN201910757991.0A 2019-08-16 2019-08-16 System and method for automatically collecting webpage data Active CN110489698B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910757991.0A CN110489698B (en) 2019-08-16 2019-08-16 System and method for automatically collecting webpage data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910757991.0A CN110489698B (en) 2019-08-16 2019-08-16 System and method for automatically collecting webpage data

Publications (2)

Publication Number Publication Date
CN110489698A CN110489698A (en) 2019-11-22
CN110489698B true CN110489698B (en) 2023-03-21

Family

ID=68551390

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910757991.0A Active CN110489698B (en) 2019-08-16 2019-08-16 System and method for automatically collecting webpage data

Country Status (1)

Country Link
CN (1) CN110489698B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111538482A (en) * 2020-03-18 2020-08-14 中国平安人寿保险股份有限公司 Webpage making method and device, computer equipment and storage medium
CN112364267B (en) * 2020-10-21 2023-04-07 杭州大搜车汽车服务有限公司 Front-end data acquisition method and device
CN113342629B (en) * 2021-06-08 2023-03-07 微民保险代理有限公司 Operation track restoration method and device, computer equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8065410B1 (en) * 2004-03-31 2011-11-22 Compuware Corporation Methods and apparatus for collecting performance metrics from a web site
CN103226474A (en) * 2013-05-10 2013-07-31 北京奇虎科技有限公司 Method, device and system for interaction between webpage script and browser program
CN106649567A (en) * 2016-11-15 2017-05-10 杭州安恒信息技术有限公司 Web crawler system based on browser kernel
CN106897357A (en) * 2017-01-04 2017-06-27 北京京拍档科技股份有限公司 A kind of method for crawling the network information for band checking distributed intelligence
CN109213948A (en) * 2018-10-18 2019-01-15 网宿科技股份有限公司 Webpage loading method, intermediate server and webpage loading system

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8462918B2 (en) * 2009-11-25 2013-06-11 Soundbite Communications, Inc. Method and system for managing interactive communications campaigns with text messaging
US10031971B2 (en) * 2013-01-09 2018-07-24 NetSuite Inc. System and methods for optimizing the response to a request for dynamic web content

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8065410B1 (en) * 2004-03-31 2011-11-22 Compuware Corporation Methods and apparatus for collecting performance metrics from a web site
CN103226474A (en) * 2013-05-10 2013-07-31 北京奇虎科技有限公司 Method, device and system for interaction between webpage script and browser program
CN106649567A (en) * 2016-11-15 2017-05-10 杭州安恒信息技术有限公司 Web crawler system based on browser kernel
CN106897357A (en) * 2017-01-04 2017-06-27 北京京拍档科技股份有限公司 A kind of method for crawling the network information for band checking distributed intelligence
CN109213948A (en) * 2018-10-18 2019-01-15 网宿科技股份有限公司 Webpage loading method, intermediate server and webpage loading system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
一种深层网的数据采集方法;陈新等;《北京信息科技大学学报(自然科学版)》;20181015(第05期);全文 *
自定规则的AJAX网页信息采集功能的设计;胡越等;《物联网技术》;20160920(第09期);全文 *

Also Published As

Publication number Publication date
CN110489698A (en) 2019-11-22

Similar Documents

Publication Publication Date Title
CN107895009B (en) Distributed internet data acquisition method and system
CN110489698B (en) System and method for automatically collecting webpage data
US7739551B2 (en) Web page error reporting
US8291405B2 (en) Automatic dependency resolution by identifying similar machine profiles
CN106682176A (en) Page loading method, equipment and device
CN112765023B (en) Test case generation method and device
US9575979B1 (en) Determining application composition and ownership
US9208235B1 (en) Systems and methods for profiling web applications
EP2951718A1 (en) Analyzing structure of web application
US20140282975A1 (en) Systems and methods for automated detection of login sequence for web form-based authentication
US11748479B2 (en) Centralized platform for validation of machine learning models for robotic process automation before deployment
US20130185645A1 (en) Determining repeat website users via browser uniqueness tracking
US11210198B2 (en) Distributed web page performance monitoring methods and systems
CN111984357A (en) Resource loading method, device and equipment based on WebWorker and storage medium
WO2021129335A1 (en) Operation monitoring method and apparatus, operation analysis method and apparatus
CN102870118A (en) Access method, device and system to user behavior
CN113934913A (en) Data capture method and device, storage medium and electronic equipment
CN113704110A (en) Automatic testing method and device for user interface
US20180210819A1 (en) System and method of controlling a web browser plug-in for testing analytics
CN111026945B (en) Multi-platform crawler scheduling method, device and storage medium
US11841837B2 (en) Computer-based systems and methods for risk detection, visualization, and resolution using modular chainable algorithms
US20240000192A1 (en) Methods, systems and computer readable media for providing a user interface for html sap applications
US20210342147A1 (en) Micro-application creation and execution
CN116302895A (en) User behavior collection method, device, equipment and medium based on Javascript
US10938931B1 (en) Central subscription platform

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 210000 10 / F, building D-1, Greenland window, Yuhuatai District, Nanjing City, Jiangsu Province

Applicant after: Cloud accounting room network technology Co.,Ltd.

Address before: 210000 10 / F, building D-1, Greenland window, Yuhuatai District, Nanjing City, Jiangsu Province

Applicant before: NANJING YUNZHANGFANG NETWORK TECHNOLOGY Co.,Ltd.

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant
CP02 Change in the address of a patent holder

Address after: Floor 5, Building H, Shuntian R&D Center, No. 21 Software Avenue, Yuhuatai District, Nanjing City, Jiangsu Province, 210000

Patentee after: Cloud accounting room network technology Co.,Ltd.

Address before: 210000 10 / F, building D-1, Greenland window, Yuhuatai District, Nanjing City, Jiangsu Province

Patentee before: Cloud accounting room network technology Co.,Ltd.

CP02 Change in the address of a patent holder