CN115883397B

CN115883397B - Universal web-end data acquisition method

Info

Publication number: CN115883397B
Application number: CN202211359874.7A
Authority: CN
Inventors: 李玺; 康锐文; 冯凯; 王元卓
Original assignee: China Science And Technology Big Data Research Institute
Current assignee: China Science And Technology Big Data Research Institute
Priority date: 2022-11-02
Filing date: 2022-11-02
Publication date: 2024-04-05
Anticipated expiration: 2042-11-02
Also published as: CN115883397A

Abstract

The invention discloses a general acquisition method for web data, which relates to the technical field of data mining and comprises five steps, namely, firstly, configuring service information to complete initialization configuration, secondly, creating a remote start task instance of a drive window, secondly, establishing request connection with a server to forward response content for data remote scheduling, and finally, obtaining information to complete disconnection of a session window. The data acquisition method not only reduces the difficulty of web-end data acquisition, has stronger universality, has simpler whole acquisition process, but also further improves the efficiency of user data acquisition.

Description

Universal web-end data acquisition method

Technical Field

The invention relates to the technical field of data mining, in particular to a universal web-side data acquisition method.

Background

Web data collection refers to a computer software technology that extracts information from a Web site, which can extract any data that can be displayed on a browser, with the ultimate goal of extracting unstructured information from a large number of Web pages for storage in a structured manner, and Web crawlers are programs that automatically extract Web pages, which are important components of search engines, for downloading Web pages from the world wide Web.

The data acquisition is an important link in the field of data mining, the main stream data source is mainly based on a web end for searching, and along with the bidirectional development of crawler and anticreeper technologies, the web end data acquisition difficulty is increased, besides common asynchronous loading and interface signature verification, the anticreeper modes such as dynamic cookies, device fingerprints, custom data transmission protocols, user behavior verification and the like are also added, so that the crawler technology is resisted, the web end data acquisition difficulty is increased, and in order to successfully acquire data in different forms of anticreeper measures, the invention provides a web end data general acquisition method by combining the data mining and web crawler technologies.

Disclosure of Invention

Aiming at the problem that the existing web-side data acquisition difficulty is gradually increased and the anti-crawling technology of the anti-web crawler is increasingly enhanced, the invention provides a general web-side data acquisition method which combines the data mining and the web crawler technology to meet the requirement that users successfully acquire data in different anti-crawling measures.

The invention solves the technical problems by adopting the scheme that: a general acquisition method for web-side data comprises the following steps: step one: configuring environment information of a driving instance, creating a persistence container Sqlite database, simultaneously creating a task table record acquisition task, starting web service, exposing three interfaces of create, xhr and close, and completing service initialization configuration;

step two: creating a drive window according to the three interfaces locally and remotely starting a task instance;

step three: traversing a task queue, acquiring all subtask information in the task list, transmitting subtasks, session_id and process_url through a xhr interface to establish request connection, enabling a driver to access a target webpage, enabling the driver to initiate xhr a request on subtask links in an initial page, and realizing the request of the internal environment of a website;

step four: the server asynchronously extracts response content from the interface of the request object xhr, packages the response content ResponseText into a Json format, returns the Json format to the client through the xhr interface, and after receiving the data, the client ends the data acquisition of the sub-task, simultaneously analyzes and stores the successfully acquired data, and waits for new response content;

step five: and when all the tasks in the task list are executed, remotely closing a session window of the task instance through the server, and deleting the task record in the database to release the driving memory.

As a preferable technical scheme of the invention, the environment information of the configuration driving example in the first step comprises a driving path, agent setting, interface attribute and high-concealment environment.

As an optimal technical scheme of the invention, when the driving window is created in the step two, a task name and an initial page address are firstly confirmed, and a task instance is created through a create interface.

As a preferable technical scheme of the invention, the session_id, the process_url and the task name of the task instance in the second step are stored in a task table.

As an optimal technical scheme of the invention, in the third step, a driver initiates xhr requests for subtask links in an initial page in a manner of loading Js scripts, so that the requests for the internal environment of the website are realized.

In the fifth step, the task name, the session_id and the process_url are transmitted through the close interface, and the server is connected to the drive window corresponding to the task name through the session identifier, so as to remotely close the drive window.

Compared with the prior art, the invention has the beneficial effects that: the invention provides a general acquisition method of web data, which comprises the steps of firstly establishing a web service scheduling driving instance, then completing data loading in the internal environment of a website according to a remote operation instance of client task information, and finally forwarding the data back to the client for processing and analysis, so that on one hand, the difficulty of web data acquisition is reduced, the information of all web pages can be checked and rapidly acquired without looking at an interface, the development efficiency is improved, on the other hand, the flow of remote data acquisition is realized, the acquisition of target information can be completed through the acquisition method, the acquisition process is relatively high in universality, the acquisition process is not complex, the difficulty of web data acquisition is reduced, and the efficiency of user data acquisition is further improved.

Drawings

FIG. 1 is an overall flow chart of the present invention;

FIG. 2 is a flow chart of step one of the present invention;

FIG. 3 is a source file code diagram of step one of the present invention;

FIG. 4 is a source file code diagram of step two of the present invention;

FIG. 5 is a source file code diagram of a drive window created in step two of the present invention;

FIG. 6 is a flow chart of step four of the present invention;

FIG. 7 is a source file code diagram of step seven of the present invention.

Detailed Description

The invention will be further described with reference to the drawings and examples.

Referring to fig. 1-5, the present invention provides a technical solution of a general web-side data collection method, which is used for web-side data collection by combining with a web crawler technology in a progressively enhanced anti-crawling technology.

Embodiment one:

according to the method shown in fig. 1-7, a general web-side data acquisition method comprises the following steps: step one: configuring service information to finish service initialization configuration; 1-3, firstly configuring environment information of a driving instance, including a driving path, agent setting, interface attribute, high-concealment environment and the like, creating a persistence container Sqlite database, simultaneously creating a task table record acquisition task, then starting web service through flash to expose three interfaces of create, xhr and close for remote call, thereby completing service initialization configuration; step two: creating a drive window and remotely starting a task instance; specifically, referring to fig. 4-5, a driving window is created locally according to three interfaces of create, xhr and close, firstly, a task name and an initial page address are confirmed, it is to be reminded that, in this embodiment, a task name of "process" is created, any one of websites which want to be accessed is selected by the initial page address, a task instance named "process" is created through the create task name according to the environmental information configured in the step one, then an initial page is requested to keep the normal running state of the driving, after the web service is started, each visitor server is allocated an id, namely a session_id, for acquiring or reconfiguring the currently stored code, after the page is initialized, the session_id, the process_url and the instance information of the task name "process" of the task instance are stored in the created task table, so as to remotely start the task instance; step three: establishing request connection and intra-station environment request; inquiring session information session_id and process_url in a database through the task name created in the second step, traversing all tasks to be acquired in a task table to acquire information of all subtasks, transmitting the subtask, session_id and process_url information through a xhr interface, waiting for a driver to finish task loading, enabling the driver to access the target webpage confirmed in the second step, and linking the target webpage to a driving window corresponding to the task name through a session identifier after the server receives the subtask information transmitted through the xhr interface, wherein the driver initiates an xmlhttpRequest request for the subtask link in an initial page in a Js script loading mode at the moment, so that the request of the internal environment of the website is realized; step four: forwarding response content and remotely scheduling data; referring to fig. 6, the server asynchronously extracts the response content from the request object xmlHttpRequest, packages the response content ResponseText into Json format, returns the Json format to the client through the xhr interface, ends the subtask after the client receives the data, analyzes and stores the successfully collected data, and returns to wait for extracting new response content from the request object xmlHttpRequest until all the subtasks in the task list are completely transferred to the client for processing through the execution of the process; step five: disconnecting the session window and releasing the driving memory; referring to fig. 7, after all subtasks are executed, the whole collection flow is finished, and the driving window can be closed automatically, that is, the task name, session_id and process_url are transmitted through a close interface, the server can be linked to the driving window corresponding to the task name through a session identifier, the session of the driving instance is disconnected to realize remote closing of the driving window, and meanwhile, the occupation of the task record in the database in the memory is deleted to release the driving memory. The method can complete the acquisition of the target information through the steps, has strong universality, is not complex in process, reduces the difficulty of web-end data acquisition, realizes the flow of remote data acquisition, and further improves the development efficiency.

The foregoing description of the preferred embodiments of the invention is not intended to limit the invention to the particular embodiments disclosed, but on the contrary, the intention is to cover all modifications, equivalents, and alternatives falling within the spirit and scope of the invention.

Claims

1. The universal web-side data acquisition method is characterized by comprising the following steps of:

step one: configuring environment information of a driving instance, creating a persistence container Sqlite database, simultaneously creating a task table record acquisition task, starting web service, exposing three interfaces of create, xhr and close, and completing service initialization configuration;

2. The web-side data general acquisition method according to claim 1, wherein: and in the first step, the environment information of the driving instance is configured, wherein the environment information comprises a driving path, agent setting, interface attribute and high-concealment environment.

3. The web-side data general acquisition method according to claim 1, wherein: and in the second step, when the drive window is created, firstly, a task name and an initial page address are confirmed, and a task instance is created through a create interface.

4. A method for universal collection of web-side data according to claim 3, wherein: and storing the session_id, the process_url and the task name of the task instance in the step two into a task table.

5. The web-side data general acquisition method according to claim 1, wherein: and in the third step, the driver initiates xhr requests for subtask links in the initial page in a manner of loading the Js script, so that the requests for the internal environment of the website are realized.

6. The web-side data general acquisition method according to claim 1, wherein: in the fifth step, the task name, the session_id and the process_url are transmitted through a close interface, and the server is connected to the drive window corresponding to the task name through the session identifier, so that the drive window is closed remotely.