CN112800307A - Configurable webpage information crawling method based on Java Web - Google Patents
Configurable webpage information crawling method based on Java Web Download PDFInfo
- Publication number
- CN112800307A CN112800307A CN202110094952.4A CN202110094952A CN112800307A CN 112800307 A CN112800307 A CN 112800307A CN 202110094952 A CN202110094952 A CN 202110094952A CN 112800307 A CN112800307 A CN 112800307A
- Authority
- CN
- China
- Prior art keywords
- data
- configurable
- configuration
- information
- webpage
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/23—Updating
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/40—Transformation of program code
- G06F8/41—Compilation
- G06F8/42—Syntactic analysis
- G06F8/427—Parsing
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Software Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Information Transfer Between Computers (AREA)
Abstract
The invention discloses a configurable webpage information crawling method based on JavaWeb, and belongs to the technical field of software development. The configurable webpage information crawling method based on JavaWeb is displayed in the form of task flows, each task flow comprises crawling information configuration, webpage information analysis, data storage configuration and data timing update, specific webpage information is crawled, analyzed data are stored in a MySQL database, and timing update of the data is completed. The configurable webpage information crawling method based on JavaWeb can automatically update webpage data, achieves the synchronization effect of crawling data information, ensures the effectiveness and timeliness of the data, and has good popularization and application values.
Description
Technical Field
The invention relates to the technical field of software development, and particularly provides a configurable webpage information crawling method based on Java Web.
Background
Java is a widely used computer programming language, has the characteristics of cross-platform, object-oriented and generic programming, and is widely applied to enterprise-level Web application development and mobile application development. MySQL is an open source relational database management system (RDBMS) that uses the most common database management language, Structured Query Language (SQL), for database management.
For the working requirements of some scenes, corresponding data information needs to be acquired from a specific website, however, data can be updated at irregular time, so that the latest data cannot be synchronized, and the effectiveness and timeliness of the data cannot be guaranteed.
Disclosure of Invention
The technical task of the invention is to provide a configurable webpage information crawling method based on Java Web, which can automatically update webpage data, achieve the synchronization effect of crawling data information and ensure the effectiveness and timeliness of data.
In order to achieve the purpose, the invention provides the following technical scheme:
a configurable webpage information crawling method based on Java Web is disclosed, and the method is displayed in the form of task flows, wherein each task flow comprises crawling information configuration, webpage information analysis, data storage configuration and data timing update, specific webpage information is crawled, analyzed data is stored in a MySQL database, and the timing update of the data is completed.
The method is developed in a task flow form, namely each crawler is an independent task flow.
Preferably, the crawling information configuration supports a general mode and a user-defined mode, the general mode is realized by inputting correct configuration parameters, in the user-defined mode, Yaml format parameters are preset, and a user selects used parameters according to requirements.
Preferably, the configuration items of the general mode include a request address, a request parameter and a request Header for acquiring webpage information data.
Preferably, the preset parameters include configuration of an acquisition token, a cookie, a password encryption mode and an authentication code parameter.
Preferably, the webpage information analysis is a test performed on the configuration after the configuration of the crawling information is completed, and if the test is not wrong, a webpage crawler is performed to acquire data, obtain a response message, and analyze the message.
Preferably, the webpage data parsing supports data in HTML, XML and JSON formats, wherein parsing of messages in the HTML and JSON formats is achieved through a Jsoup component, and parsing of messages in the XML formats is achieved through Dom4 j.
The method of the invention supports persisting crawled data into a database. It is noted that the configuration of the data store needs to be completed before storage.
Preferably, the data storage configuration supports the setting of the corresponding relation between the result of the webpage information analysis and the MySQL data table, and forms a configuration file to be persisted in the database.
The data storage configuration contains the following information: the data crawling task corresponds to the table in MySQL and corresponds to the table in the database; the corresponding relation between the message body of the response message of the task and the MySQL field, namely the corresponding relation between the field (or content node) of the message and the field of the MySQL table. Based on the data storage configuration, automatic creation and updating of a database table and table fields can be realized; meanwhile, the data returned by each crawling of each task can be stored in a data table (preset) corresponding to the MySQL database.
The crawler task in the invention supports two modes of independent operation and timing execution. The operation is independent, namely the operation is not executed after one time; and executing at fixed time, setting an execution interval, finally generating a Cron expression, and triggering the execution of the task after the rest turns.
Preferably, the data is updated at regular time, the setting of a unique key of the crawled data is set, and the data is updated based on the unique key as a data timing updating mark. For the task with changed data, a timing task can be selected for updating, the execution period of the timing task can be customized by using a Cron expression to obtain the latest data, and the latest data is updated to the database after comparison.
Preferably, the data timing update supports full-volume update and difference update, the full-volume update is to empty the data in the table and then insert the data in batch, and the difference update is to update only the change part.
Compared with the prior art, the configurable webpage information crawling method based on Java Web has the following outstanding beneficial effects: the configurable webpage information crawling method based on the Java Web can crawl specific webpage information through configuration, stores information contents into the MySQL database, can set reasonable updating time, can achieve the synchronization effect of crawling data information, and has good popularization and application values.
Drawings
FIG. 1 is a flowchart of a configurable method for crawling Web page information based on Java Web according to the present invention.
Detailed Description
The following describes the configurable Web page information crawling method based on Java Web in detail with reference to the accompanying drawings and embodiments.
Examples
As shown in fig. 1, the configurable webpage information crawling method based on Java Web of the present invention is shown in the form of task flows, each task flow includes crawling information configuration, webpage information parsing, data storage configuration and data timing update, specific webpage information is crawled, and parsed data is stored in a MySQL database, so as to complete timing update of data. The expansion is carried out in the form of task flows, namely, each crawler is an independent task flow.
The crawling information configuration supports a general mode and a custom mode.
General mode: by entering the correct configuration parameters, the configuration items include: and requesting an address, request parameters and a Header to acquire webpage information data.
A self-defining mode: this mode supports a high degree of customization. Various Yaml format parameters including some parameters of commonly used scenes are preset; the user selects parameters to be used according to the requirement of the user, and the preset parameters comprise the configuration of parameters such as token acquisition, cookie encryption, password encryption mode, verification code and the like.
And the webpage information analysis is a test performed on the configuration after the configuration of the crawling information is completed, and if the test is not wrong, a webpage crawler is performed to acquire data, obtain a response message, and analyze the message. The webpage data analysis supports data in HTML, XML and JSON formats, wherein the messages in the HTML and JSON formats are analyzed by using a Jsoup component, and the messages in the XML formats are analyzed by using Dom4 j. The method of the invention supports persisting crawled data into a database. It is noted that the configuration of the data store needs to be completed before storage.
And data storage configuration, which supports the setting of the corresponding relation between the result of webpage information analysis and the MySQL data table, and forms a configuration file to be persisted into a database.
The data storage configuration contains the following information: the data crawling task corresponds to the table in MySQL and corresponds to the table in the database; the corresponding relation between the message body of the response message of the task and the MySQL field, namely the corresponding relation between the field (or content node) of the message and the field of the MySQL table. Based on the data storage configuration, automatic creation and updating of a database table and table fields can be realized; meanwhile, the data returned by each crawling of each task can be stored in a data table (preset) corresponding to the MySQL database.
The crawler task in the invention supports two modes of independent operation and timing execution. The operation is independent, namely the operation is not executed after one time; and executing at fixed time, setting an execution interval, finally generating a Cron expression, and triggering the execution of the task after the rest turns.
And updating the data at regular time, setting the setting of a unique key of the crawled data, and updating the data based on the unique key as a data regular updating mark. For the task with changed data, a timing task can be selected for updating, the execution period of the timing task can be customized by using a Cron expression to obtain the latest data, and the latest data is updated to the database after comparison.
The data timing update supports full update and difference update, the full update is to empty the data in the table and then insert the data in batch, and the difference update is to update only the change part.
The configurable webpage information crawling method based on the Java Web can realize the series connection of a plurality of nodes by using a multi-task flow form, can realize the crawling of specific webpage information, stores the analyzed data into a MySQL database, and can realize the timing updating of the data. The crawl information can be set using generic and custom modes (using YAML) and can be tested. Web page information data in a variety of formats can be parsed by a variety of different components. And through configuration, an update table and a field can be automatically created, and the webpage information can be automatically analyzed to the MySQL database for storage. The webpage information data in the MySQL database can be updated regularly through various strategies.
The above-described embodiments are merely preferred embodiments of the present invention, and general changes and substitutions by those skilled in the art within the technical scope of the present invention are included in the protection scope of the present invention.
Claims (9)
1. A configurable webpage information crawling method based on JavaWeb is characterized in that: the method is displayed in the form of task flows, each task flow comprises crawling information configuration, webpage information analysis, data storage configuration and data timing update, specific webpage information is crawled, analyzed data are stored in a MySQL database, and timing update of the data is completed.
2. The Java Web-based configurable Web page information crawling method according to claim 1, wherein: the crawling information configuration supports a general mode and a user-defined mode, the general mode is realized by inputting correct configuration parameters, Yaml format parameters are preset in the user-defined mode, and a user selects the used parameters according to requirements.
3. The configurable java web-based web page information crawling method according to claim 2, wherein: the configuration items of the general mode comprise a request address, a request parameter and a request Header for acquiring webpage information data.
4. The configurable JavaWeb-based web page information crawling method according to claim 3, wherein: the preset parameters comprise the configuration of the acquisition token, the cookie, the password encryption mode and the verification code parameters.
5. The configurable JavaWeb-based web page information crawling method according to claim 4, wherein: and the webpage information analysis is a test performed on the configuration after the configuration of the crawling information is completed, and if the test is not wrong, a webpage crawler is performed to acquire data, obtain a response message and analyze the message.
6. The configurable JavaWeb-based web page information crawling method according to claim 5, wherein: the webpage data analysis supports data in HTML, XML and JSON formats, wherein the messages in the HTML and JSON formats are analyzed by using a Jsoup component, and the messages in the XML formats are analyzed by using Dom4 j.
7. The configurable JavaWeb-based web page information crawling method according to claim 6, wherein: and the data storage configuration supports the setting of the corresponding relation between the result of the webpage information analysis and the MySQL data table, and forms a configuration file to be persisted into a database.
8. The configurable JavaWeb-based web page information crawling method according to claim 7, wherein: and updating the data at regular time, setting the setting of a unique key of the crawled data, and updating the data based on the unique key as a data regular updating mark.
9. The Java Web-based configurable Web page information crawling method according to claim 8, wherein: the data timing update supports full update and difference update, the full update is to empty the data in the table and then insert the data in batch, and the difference update is to update only the change part.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110094952.4A CN112800307A (en) | 2021-01-25 | 2021-01-25 | Configurable webpage information crawling method based on Java Web |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110094952.4A CN112800307A (en) | 2021-01-25 | 2021-01-25 | Configurable webpage information crawling method based on Java Web |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112800307A true CN112800307A (en) | 2021-05-14 |
Family
ID=75811478
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110094952.4A Pending CN112800307A (en) | 2021-01-25 | 2021-01-25 | Configurable webpage information crawling method based on Java Web |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112800307A (en) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20100138485A1 (en) * | 2008-12-03 | 2010-06-03 | William Weiyeh Chow | System and method for providing virtual web access |
CN109597952A (en) * | 2018-12-10 | 2019-04-09 | 江苏满运软件科技有限公司 | Web information processing method, system, electronic equipment and storage medium |
CN110245278A (en) * | 2018-09-05 | 2019-09-17 | 爱信诺征信有限公司 | Acquisition method, device, electronic equipment and the storage medium of web data |
-
2021
- 2021-01-25 CN CN202110094952.4A patent/CN112800307A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20100138485A1 (en) * | 2008-12-03 | 2010-06-03 | William Weiyeh Chow | System and method for providing virtual web access |
CN110245278A (en) * | 2018-09-05 | 2019-09-17 | 爱信诺征信有限公司 | Acquisition method, device, electronic equipment and the storage medium of web data |
CN109597952A (en) * | 2018-12-10 | 2019-04-09 | 江苏满运软件科技有限公司 | Web information processing method, system, electronic equipment and storage medium |
Non-Patent Citations (1)
Title |
---|
程序O人生: "java定时爬取数据", 《CSDN》 * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN102567539B (en) | Intelligent WEB report implementation method and intelligent WEB report implementation system | |
US7216340B1 (en) | Analysis data validation tool for use in enterprise architecture modeling with result based model updating | |
JP5065056B2 (en) | Method, computer program, and system for processing a workflow (integrating data management operations into a workflow system) | |
US7707544B2 (en) | System and method for generating and reusing software application code with source definition files | |
US7424702B1 (en) | Data integration techniques for use in enterprise architecture modeling | |
CN112650766B (en) | Database data operation method, system and server | |
US10324929B2 (en) | Provision of position data for query runtime errors | |
EP1585036A2 (en) | Management of parameterized database queries | |
US20080250394A1 (en) | Synchronizing external documentation with code development | |
US9361398B1 (en) | Maintaining a relational database and its schema in response to a stream of XML messages based on one or more arbitrary and evolving XML schemas | |
US20170031661A1 (en) | Systems and methods for transactional applications in an unreliable wireless network | |
WO2004086222A2 (en) | Development of software systems | |
US6567819B1 (en) | Run time objects | |
US20140244680A1 (en) | Sql query parsing and translation | |
US7831614B2 (en) | System and method for generating SQL using templates | |
CN107315764B (en) | Method and system for updating non-relational database associated data | |
CN109426725A (en) | Data desensitization method, equipment and computer readable storage medium | |
CN104679500B (en) | Method and device for realizing automatic generation of entity class | |
CN112434059A (en) | Data processing method, data processing device, computer equipment and storage medium | |
CN105975489A (en) | Metadata-based online SQL code completion method | |
US7650276B2 (en) | System and method for dynamic data binding in distributed applications | |
CN103927168A (en) | Object-oriented data model persistence method and device | |
CN111881043B (en) | Page testing method and device, storage medium and processor | |
CN112800307A (en) | Configurable webpage information crawling method based on Java Web | |
CN101662843A (en) | Fast establishing method of WAP website |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20210514 |
|
RJ01 | Rejection of invention patent application after publication |