CN112800307A

CN112800307A - Configurable webpage information crawling method based on Java Web

Info

Publication number: CN112800307A
Application number: CN202110094952.4A
Authority: CN
Inventors: 肖培玉; 魏金雷; 徐士强; 梁圣奇; 赵子恒
Original assignee: Inspur Cloud Information Technology Co Ltd
Current assignee: Inspur Cloud Information Technology Co Ltd
Priority date: 2021-01-25
Filing date: 2021-01-25
Publication date: 2021-05-14

Abstract

The invention discloses a configurable webpage information crawling method based on JavaWeb, and belongs to the technical field of software development. The configurable webpage information crawling method based on JavaWeb is displayed in the form of task flows, each task flow comprises crawling information configuration, webpage information analysis, data storage configuration and data timing update, specific webpage information is crawled, analyzed data are stored in a MySQL database, and timing update of the data is completed. The configurable webpage information crawling method based on JavaWeb can automatically update webpage data, achieves the synchronization effect of crawling data information, ensures the effectiveness and timeliness of the data, and has good popularization and application values.

Description

Configurable webpage information crawling method based on Java Web

Technical Field

The invention relates to the technical field of software development, and particularly provides a configurable webpage information crawling method based on Java Web.

Background

Java is a widely used computer programming language, has the characteristics of cross-platform, object-oriented and generic programming, and is widely applied to enterprise-level Web application development and mobile application development. MySQL is an open source relational database management system (RDBMS) that uses the most common database management language, Structured Query Language (SQL), for database management.

For the working requirements of some scenes, corresponding data information needs to be acquired from a specific website, however, data can be updated at irregular time, so that the latest data cannot be synchronized, and the effectiveness and timeliness of the data cannot be guaranteed.

Disclosure of Invention

The technical task of the invention is to provide a configurable webpage information crawling method based on Java Web, which can automatically update webpage data, achieve the synchronization effect of crawling data information and ensure the effectiveness and timeliness of data.

In order to achieve the purpose, the invention provides the following technical scheme:

a configurable webpage information crawling method based on Java Web is disclosed, and the method is displayed in the form of task flows, wherein each task flow comprises crawling information configuration, webpage information analysis, data storage configuration and data timing update, specific webpage information is crawled, analyzed data is stored in a MySQL database, and the timing update of the data is completed.

The method is developed in a task flow form, namely each crawler is an independent task flow.

Preferably, the crawling information configuration supports a general mode and a user-defined mode, the general mode is realized by inputting correct configuration parameters, in the user-defined mode, Yaml format parameters are preset, and a user selects used parameters according to requirements.

Preferably, the configuration items of the general mode include a request address, a request parameter and a request Header for acquiring webpage information data.

Preferably, the preset parameters include configuration of an acquisition token, a cookie, a password encryption mode and an authentication code parameter.

Preferably, the webpage information analysis is a test performed on the configuration after the configuration of the crawling information is completed, and if the test is not wrong, a webpage crawler is performed to acquire data, obtain a response message, and analyze the message.

Preferably, the webpage data parsing supports data in HTML, XML and JSON formats, wherein parsing of messages in the HTML and JSON formats is achieved through a Jsoup component, and parsing of messages in the XML formats is achieved through Dom4 j.

The method of the invention supports persisting crawled data into a database. It is noted that the configuration of the data store needs to be completed before storage.

Preferably, the data storage configuration supports the setting of the corresponding relation between the result of the webpage information analysis and the MySQL data table, and forms a configuration file to be persisted in the database.

The data storage configuration contains the following information: the data crawling task corresponds to the table in MySQL and corresponds to the table in the database; the corresponding relation between the message body of the response message of the task and the MySQL field, namely the corresponding relation between the field (or content node) of the message and the field of the MySQL table. Based on the data storage configuration, automatic creation and updating of a database table and table fields can be realized; meanwhile, the data returned by each crawling of each task can be stored in a data table (preset) corresponding to the MySQL database.

The crawler task in the invention supports two modes of independent operation and timing execution. The operation is independent, namely the operation is not executed after one time; and executing at fixed time, setting an execution interval, finally generating a Cron expression, and triggering the execution of the task after the rest turns.

Preferably, the data is updated at regular time, the setting of a unique key of the crawled data is set, and the data is updated based on the unique key as a data timing updating mark. For the task with changed data, a timing task can be selected for updating, the execution period of the timing task can be customized by using a Cron expression to obtain the latest data, and the latest data is updated to the database after comparison.

Preferably, the data timing update supports full-volume update and difference update, the full-volume update is to empty the data in the table and then insert the data in batch, and the difference update is to update only the change part.

Compared with the prior art, the configurable webpage information crawling method based on Java Web has the following outstanding beneficial effects: the configurable webpage information crawling method based on the Java Web can crawl specific webpage information through configuration, stores information contents into the MySQL database, can set reasonable updating time, can achieve the synchronization effect of crawling data information, and has good popularization and application values.

Drawings

FIG. 1 is a flowchart of a configurable method for crawling Web page information based on Java Web according to the present invention.

Detailed Description

The following describes the configurable Web page information crawling method based on Java Web in detail with reference to the accompanying drawings and embodiments.

Examples

As shown in fig. 1, the configurable webpage information crawling method based on Java Web of the present invention is shown in the form of task flows, each task flow includes crawling information configuration, webpage information parsing, data storage configuration and data timing update, specific webpage information is crawled, and parsed data is stored in a MySQL database, so as to complete timing update of data. The expansion is carried out in the form of task flows, namely, each crawler is an independent task flow.

The crawling information configuration supports a general mode and a custom mode.

General mode: by entering the correct configuration parameters, the configuration items include: and requesting an address, request parameters and a Header to acquire webpage information data.

A self-defining mode: this mode supports a high degree of customization. Various Yaml format parameters including some parameters of commonly used scenes are preset; the user selects parameters to be used according to the requirement of the user, and the preset parameters comprise the configuration of parameters such as token acquisition, cookie encryption, password encryption mode, verification code and the like.

And the webpage information analysis is a test performed on the configuration after the configuration of the crawling information is completed, and if the test is not wrong, a webpage crawler is performed to acquire data, obtain a response message, and analyze the message. The webpage data analysis supports data in HTML, XML and JSON formats, wherein the messages in the HTML and JSON formats are analyzed by using a Jsoup component, and the messages in the XML formats are analyzed by using Dom4 j. The method of the invention supports persisting crawled data into a database. It is noted that the configuration of the data store needs to be completed before storage.

And data storage configuration, which supports the setting of the corresponding relation between the result of webpage information analysis and the MySQL data table, and forms a configuration file to be persisted into a database.

And updating the data at regular time, setting the setting of a unique key of the crawled data, and updating the data based on the unique key as a data regular updating mark. For the task with changed data, a timing task can be selected for updating, the execution period of the timing task can be customized by using a Cron expression to obtain the latest data, and the latest data is updated to the database after comparison.

The data timing update supports full update and difference update, the full update is to empty the data in the table and then insert the data in batch, and the difference update is to update only the change part.

The configurable webpage information crawling method based on the Java Web can realize the series connection of a plurality of nodes by using a multi-task flow form, can realize the crawling of specific webpage information, stores the analyzed data into a MySQL database, and can realize the timing updating of the data. The crawl information can be set using generic and custom modes (using YAML) and can be tested. Web page information data in a variety of formats can be parsed by a variety of different components. And through configuration, an update table and a field can be automatically created, and the webpage information can be automatically analyzed to the MySQL database for storage. The webpage information data in the MySQL database can be updated regularly through various strategies.

The above-described embodiments are merely preferred embodiments of the present invention, and general changes and substitutions by those skilled in the art within the technical scope of the present invention are included in the protection scope of the present invention.

Claims

1. A configurable webpage information crawling method based on JavaWeb is characterized in that: the method is displayed in the form of task flows, each task flow comprises crawling information configuration, webpage information analysis, data storage configuration and data timing update, specific webpage information is crawled, analyzed data are stored in a MySQL database, and timing update of the data is completed.

2. The Java Web-based configurable Web page information crawling method according to claim 1, wherein: the crawling information configuration supports a general mode and a user-defined mode, the general mode is realized by inputting correct configuration parameters, Yaml format parameters are preset in the user-defined mode, and a user selects the used parameters according to requirements.

3. The configurable java web-based web page information crawling method according to claim 2, wherein: the configuration items of the general mode comprise a request address, a request parameter and a request Header for acquiring webpage information data.

4. The configurable JavaWeb-based web page information crawling method according to claim 3, wherein: the preset parameters comprise the configuration of the acquisition token, the cookie, the password encryption mode and the verification code parameters.

5. The configurable JavaWeb-based web page information crawling method according to claim 4, wherein: and the webpage information analysis is a test performed on the configuration after the configuration of the crawling information is completed, and if the test is not wrong, a webpage crawler is performed to acquire data, obtain a response message and analyze the message.

6. The configurable JavaWeb-based web page information crawling method according to claim 5, wherein: the webpage data analysis supports data in HTML, XML and JSON formats, wherein the messages in the HTML and JSON formats are analyzed by using a Jsoup component, and the messages in the XML formats are analyzed by using Dom4 j.

7. The configurable JavaWeb-based web page information crawling method according to claim 6, wherein: and the data storage configuration supports the setting of the corresponding relation between the result of the webpage information analysis and the MySQL data table, and forms a configuration file to be persisted into a database.

8. The configurable JavaWeb-based web page information crawling method according to claim 7, wherein: and updating the data at regular time, setting the setting of a unique key of the crawled data, and updating the data based on the unique key as a data regular updating mark.

9. The Java Web-based configurable Web page information crawling method according to claim 8, wherein: the data timing update supports full update and difference update, the full update is to empty the data in the table and then insert the data in batch, and the difference update is to update only the change part.