CN115883397B - Universal web-end data acquisition method - Google Patents

Universal web-end data acquisition method Download PDF

Info

Publication number
CN115883397B
CN115883397B CN202211359874.7A CN202211359874A CN115883397B CN 115883397 B CN115883397 B CN 115883397B CN 202211359874 A CN202211359874 A CN 202211359874A CN 115883397 B CN115883397 B CN 115883397B
Authority
CN
China
Prior art keywords
task
web
xhr
interface
acquisition method
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211359874.7A
Other languages
Chinese (zh)
Other versions
CN115883397A (en
Inventor
李玺
康锐文
冯凯
王元卓
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Science And Technology Big Data Research Institute
Original Assignee
China Science And Technology Big Data Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Science And Technology Big Data Research Institute filed Critical China Science And Technology Big Data Research Institute
Priority to CN202211359874.7A priority Critical patent/CN115883397B/en
Publication of CN115883397A publication Critical patent/CN115883397A/en
Application granted granted Critical
Publication of CN115883397B publication Critical patent/CN115883397B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a general acquisition method for web data, which relates to the technical field of data mining and comprises five steps, namely, firstly, configuring service information to complete initialization configuration, secondly, creating a remote start task instance of a drive window, secondly, establishing request connection with a server to forward response content for data remote scheduling, and finally, obtaining information to complete disconnection of a session window. The data acquisition method not only reduces the difficulty of web-end data acquisition, has stronger universality, has simpler whole acquisition process, but also further improves the efficiency of user data acquisition.

Description

Universal web-end data acquisition method
Technical Field
The invention relates to the technical field of data mining, in particular to a universal web-side data acquisition method.
Background
Web data collection refers to a computer software technology that extracts information from a Web site, which can extract any data that can be displayed on a browser, with the ultimate goal of extracting unstructured information from a large number of Web pages for storage in a structured manner, and Web crawlers are programs that automatically extract Web pages, which are important components of search engines, for downloading Web pages from the world wide Web.
The data acquisition is an important link in the field of data mining, the main stream data source is mainly based on a web end for searching, and along with the bidirectional development of crawler and anticreeper technologies, the web end data acquisition difficulty is increased, besides common asynchronous loading and interface signature verification, the anticreeper modes such as dynamic cookies, device fingerprints, custom data transmission protocols, user behavior verification and the like are also added, so that the crawler technology is resisted, the web end data acquisition difficulty is increased, and in order to successfully acquire data in different forms of anticreeper measures, the invention provides a web end data general acquisition method by combining the data mining and web crawler technologies.
Disclosure of Invention
Aiming at the problem that the existing web-side data acquisition difficulty is gradually increased and the anti-crawling technology of the anti-web crawler is increasingly enhanced, the invention provides a general web-side data acquisition method which combines the data mining and the web crawler technology to meet the requirement that users successfully acquire data in different anti-crawling measures.
The invention solves the technical problems by adopting the scheme that: a general acquisition method for web-side data comprises the following steps: step one: configuring environment information of a driving instance, creating a persistence container Sqlite database, simultaneously creating a task table record acquisition task, starting web service, exposing three interfaces of create, xhr and close, and completing service initialization configuration;
step two: creating a drive window according to the three interfaces locally and remotely starting a task instance;
step three: traversing a task queue, acquiring all subtask information in the task list, transmitting subtasks, session_id and process_url through a xhr interface to establish request connection, enabling a driver to access a target webpage, enabling the driver to initiate xhr a request on subtask links in an initial page, and realizing the request of the internal environment of a website;
step four: the server asynchronously extracts response content from the interface of the request object xhr, packages the response content ResponseText into a Json format, returns the Json format to the client through the xhr interface, and after receiving the data, the client ends the data acquisition of the sub-task, simultaneously analyzes and stores the successfully acquired data, and waits for new response content;
step five: and when all the tasks in the task list are executed, remotely closing a session window of the task instance through the server, and deleting the task record in the database to release the driving memory.
As a preferable technical scheme of the invention, the environment information of the configuration driving example in the first step comprises a driving path, agent setting, interface attribute and high-concealment environment.
As an optimal technical scheme of the invention, when the driving window is created in the step two, a task name and an initial page address are firstly confirmed, and a task instance is created through a create interface.
As a preferable technical scheme of the invention, the session_id, the process_url and the task name of the task instance in the second step are stored in a task table.
As an optimal technical scheme of the invention, in the third step, a driver initiates xhr requests for subtask links in an initial page in a manner of loading Js scripts, so that the requests for the internal environment of the website are realized.
In the fifth step, the task name, the session_id and the process_url are transmitted through the close interface, and the server is connected to the drive window corresponding to the task name through the session identifier, so as to remotely close the drive window.
Compared with the prior art, the invention has the beneficial effects that: the invention provides a general acquisition method of web data, which comprises the steps of firstly establishing a web service scheduling driving instance, then completing data loading in the internal environment of a website according to a remote operation instance of client task information, and finally forwarding the data back to the client for processing and analysis, so that on one hand, the difficulty of web data acquisition is reduced, the information of all web pages can be checked and rapidly acquired without looking at an interface, the development efficiency is improved, on the other hand, the flow of remote data acquisition is realized, the acquisition of target information can be completed through the acquisition method, the acquisition process is relatively high in universality, the acquisition process is not complex, the difficulty of web data acquisition is reduced, and the efficiency of user data acquisition is further improved.
Drawings
FIG. 1 is an overall flow chart of the present invention;
FIG. 2 is a flow chart of step one of the present invention;
FIG. 3 is a source file code diagram of step one of the present invention;
FIG. 4 is a source file code diagram of step two of the present invention;
FIG. 5 is a source file code diagram of a drive window created in step two of the present invention;
FIG. 6 is a flow chart of step four of the present invention;
FIG. 7 is a source file code diagram of step seven of the present invention.
Detailed Description
The invention will be further described with reference to the drawings and examples.
Referring to fig. 1-5, the present invention provides a technical solution of a general web-side data collection method, which is used for web-side data collection by combining with a web crawler technology in a progressively enhanced anti-crawling technology.
Embodiment one:
according to the method shown in fig. 1-7, a general web-side data acquisition method comprises the following steps: step one: configuring service information to finish service initialization configuration; 1-3, firstly configuring environment information of a driving instance, including a driving path, agent setting, interface attribute, high-concealment environment and the like, creating a persistence container Sqlite database, simultaneously creating a task table record acquisition task, then starting web service through flash to expose three interfaces of create, xhr and close for remote call, thereby completing service initialization configuration; step two: creating a drive window and remotely starting a task instance; specifically, referring to fig. 4-5, a driving window is created locally according to three interfaces of create, xhr and close, firstly, a task name and an initial page address are confirmed, it is to be reminded that, in this embodiment, a task name of "process" is created, any one of websites which want to be accessed is selected by the initial page address, a task instance named "process" is created through the create task name according to the environmental information configured in the step one, then an initial page is requested to keep the normal running state of the driving, after the web service is started, each visitor server is allocated an id, namely a session_id, for acquiring or reconfiguring the currently stored code, after the page is initialized, the session_id, the process_url and the instance information of the task name "process" of the task instance are stored in the created task table, so as to remotely start the task instance; step three: establishing request connection and intra-station environment request; inquiring session information session_id and process_url in a database through the task name created in the second step, traversing all tasks to be acquired in a task table to acquire information of all subtasks, transmitting the subtask, session_id and process_url information through a xhr interface, waiting for a driver to finish task loading, enabling the driver to access the target webpage confirmed in the second step, and linking the target webpage to a driving window corresponding to the task name through a session identifier after the server receives the subtask information transmitted through the xhr interface, wherein the driver initiates an xmlhttpRequest request for the subtask link in an initial page in a Js script loading mode at the moment, so that the request of the internal environment of the website is realized; step four: forwarding response content and remotely scheduling data; referring to fig. 6, the server asynchronously extracts the response content from the request object xmlHttpRequest, packages the response content ResponseText into Json format, returns the Json format to the client through the xhr interface, ends the subtask after the client receives the data, analyzes and stores the successfully collected data, and returns to wait for extracting new response content from the request object xmlHttpRequest until all the subtasks in the task list are completely transferred to the client for processing through the execution of the process; step five: disconnecting the session window and releasing the driving memory; referring to fig. 7, after all subtasks are executed, the whole collection flow is finished, and the driving window can be closed automatically, that is, the task name, session_id and process_url are transmitted through a close interface, the server can be linked to the driving window corresponding to the task name through a session identifier, the session of the driving instance is disconnected to realize remote closing of the driving window, and meanwhile, the occupation of the task record in the database in the memory is deleted to release the driving memory. The method can complete the acquisition of the target information through the steps, has strong universality, is not complex in process, reduces the difficulty of web-end data acquisition, realizes the flow of remote data acquisition, and further improves the development efficiency.
The foregoing description of the preferred embodiments of the invention is not intended to limit the invention to the particular embodiments disclosed, but on the contrary, the intention is to cover all modifications, equivalents, and alternatives falling within the spirit and scope of the invention.

Claims (6)

1. The universal web-side data acquisition method is characterized by comprising the following steps of:
step one: configuring environment information of a driving instance, creating a persistence container Sqlite database, simultaneously creating a task table record acquisition task, starting web service, exposing three interfaces of create, xhr and close, and completing service initialization configuration;
step two: creating a drive window according to the three interfaces locally and remotely starting a task instance;
step three: traversing a task queue, acquiring all subtask information in the task list, transmitting subtasks, session_id and process_url through a xhr interface to establish request connection, enabling a driver to access a target webpage, enabling the driver to initiate xhr a request on subtask links in an initial page, and realizing the request of the internal environment of a website;
step four: the server asynchronously extracts response content from the interface of the request object xhr, packages the response content ResponseText into a Json format, returns the Json format to the client through the xhr interface, and after receiving the data, the client ends the data acquisition of the sub-task, simultaneously analyzes and stores the successfully acquired data, and waits for new response content;
step five: and when all the tasks in the task list are executed, remotely closing a session window of the task instance through the server, and deleting the task record in the database to release the driving memory.
2. The web-side data general acquisition method according to claim 1, wherein: and in the first step, the environment information of the driving instance is configured, wherein the environment information comprises a driving path, agent setting, interface attribute and high-concealment environment.
3. The web-side data general acquisition method according to claim 1, wherein: and in the second step, when the drive window is created, firstly, a task name and an initial page address are confirmed, and a task instance is created through a create interface.
4. A method for universal collection of web-side data according to claim 3, wherein: and storing the session_id, the process_url and the task name of the task instance in the step two into a task table.
5. The web-side data general acquisition method according to claim 1, wherein: and in the third step, the driver initiates xhr requests for subtask links in the initial page in a manner of loading the Js script, so that the requests for the internal environment of the website are realized.
6. The web-side data general acquisition method according to claim 1, wherein: in the fifth step, the task name, the session_id and the process_url are transmitted through a close interface, and the server is connected to the drive window corresponding to the task name through the session identifier, so that the drive window is closed remotely.
CN202211359874.7A 2022-11-02 2022-11-02 Universal web-end data acquisition method Active CN115883397B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211359874.7A CN115883397B (en) 2022-11-02 2022-11-02 Universal web-end data acquisition method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211359874.7A CN115883397B (en) 2022-11-02 2022-11-02 Universal web-end data acquisition method

Publications (2)

Publication Number Publication Date
CN115883397A CN115883397A (en) 2023-03-31
CN115883397B true CN115883397B (en) 2024-04-05

Family

ID=85759330

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211359874.7A Active CN115883397B (en) 2022-11-02 2022-11-02 Universal web-end data acquisition method

Country Status (1)

Country Link
CN (1) CN115883397B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017206668A1 (en) * 2016-05-30 2017-12-07 中兴通讯股份有限公司 Data analysis method, device, and system
CN110147475A (en) * 2019-03-29 2019-08-20 汇通达网络股份有限公司 A kind of network data acquisition system of distributed deployment

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8402098B2 (en) * 2009-08-13 2013-03-19 Clark C. Dircz System and method for intelligence gathering and analysis
US20170124497A1 (en) * 2015-10-28 2017-05-04 Fractal Industries, Inc. System for automated capture and analysis of business information for reliable business venture outcome prediction

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017206668A1 (en) * 2016-05-30 2017-12-07 中兴通讯股份有限公司 Data analysis method, device, and system
CN110147475A (en) * 2019-03-29 2019-08-20 汇通达网络股份有限公司 A kind of network data acquisition system of distributed deployment

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Web异步加载技术分析及信息爬取策略实现;杜润泽;梁英;方英兰;;电脑知识与技术;20180825(24);全文 *
建设项目施工现场实时监理影像信息系统;董玉友;张钢;;电子测量技术;20070815(08);全文 *
面向知识库问答的问句语义解析研究综述;仇韫琦等;《电子学报》;20220930;第50卷(第09期);2242-2264 *

Also Published As

Publication number Publication date
CN115883397A (en) 2023-03-31

Similar Documents

Publication Publication Date Title
CN104866383B (en) Interface calling method and device and terminal
US6446111B1 (en) Method and apparatus for client-server communication using a limited capability client over a low-speed communications link
CN103092581B (en) The building method of a kind of web front end this locality development environment and device
US8793347B2 (en) System and method for providing virtual web access
CN111344678A (en) Collaborative software development with heterogeneous development tools
CN102118442A (en) Method and device for accessing Web resources
WO2002029548A2 (en) Http transaction monitor with capacity to replay in debugging session
CN104516885B (en) The implementation method and device of browser dual core component
EP2590090A1 (en) Dynamic interface to read database through remote procedure call
US20090019151A1 (en) Method for media discovery
US20030135587A1 (en) Method and system of state management for data communications
US20030067480A1 (en) System and method of data transmission for computer networks utilizing HTTP
CN110598135A (en) Network request processing method and device, computer readable medium and electronic equipment
US20040210433A1 (en) System, method and apparatus for emulating a web server
CN105095220B (en) A kind of browser implementation method, terminal and virtualization agent device
CN104820680A (en) Universal distributed crawler scheduling system
CN110609714A (en) Data prefetching method, device and equipment and storage medium
RU2598988C2 (en) Methods and systems for searching for application software
CN115883397B (en) Universal web-end data acquisition method
US20230412694A1 (en) Communication system for micro-frontends of a web application
CN111273964B (en) Data loading method and device
CN112541136B (en) Network address information acquisition method and device, storage medium and electronic equipment
JP5043331B2 (en) Enhanced Internet session management protocol
CN117493720A (en) First screen performance optimization method, terminal device, electronic equipment and storage medium
CN106325895B (en) Method and system for starting preloading concerned webpage

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant