CN112800307A - Configurable webpage information crawling method based on Java Web - Google Patents

Configurable webpage information crawling method based on Java Web Download PDF

Info

Publication number
CN112800307A
CN112800307A CN202110094952.4A CN202110094952A CN112800307A CN 112800307 A CN112800307 A CN 112800307A CN 202110094952 A CN202110094952 A CN 202110094952A CN 112800307 A CN112800307 A CN 112800307A
Authority
CN
China
Prior art keywords
data
configurable
configuration
information
webpage
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110094952.4A
Other languages
Chinese (zh)
Inventor
肖培玉
魏金雷
徐士强
梁圣奇
赵子恒
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inspur Cloud Information Technology Co Ltd
Original Assignee
Inspur Cloud Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inspur Cloud Information Technology Co Ltd filed Critical Inspur Cloud Information Technology Co Ltd
Priority to CN202110094952.4A priority Critical patent/CN112800307A/en
Publication of CN112800307A publication Critical patent/CN112800307A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/42Syntactic analysis
    • G06F8/427Parsing

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a configurable webpage information crawling method based on JavaWeb, and belongs to the technical field of software development. The configurable webpage information crawling method based on JavaWeb is displayed in the form of task flows, each task flow comprises crawling information configuration, webpage information analysis, data storage configuration and data timing update, specific webpage information is crawled, analyzed data are stored in a MySQL database, and timing update of the data is completed. The configurable webpage information crawling method based on JavaWeb can automatically update webpage data, achieves the synchronization effect of crawling data information, ensures the effectiveness and timeliness of the data, and has good popularization and application values.

Description

Configurable webpage information crawling method based on Java Web
Technical Field
The invention relates to the technical field of software development, and particularly provides a configurable webpage information crawling method based on Java Web.
Background
Java is a widely used computer programming language, has the characteristics of cross-platform, object-oriented and generic programming, and is widely applied to enterprise-level Web application development and mobile application development. MySQL is an open source relational database management system (RDBMS) that uses the most common database management language, Structured Query Language (SQL), for database management.
For the working requirements of some scenes, corresponding data information needs to be acquired from a specific website, however, data can be updated at irregular time, so that the latest data cannot be synchronized, and the effectiveness and timeliness of the data cannot be guaranteed.
Disclosure of Invention
The technical task of the invention is to provide a configurable webpage information crawling method based on Java Web, which can automatically update webpage data, achieve the synchronization effect of crawling data information and ensure the effectiveness and timeliness of data.
In order to achieve the purpose, the invention provides the following technical scheme:
a configurable webpage information crawling method based on Java Web is disclosed, and the method is displayed in the form of task flows, wherein each task flow comprises crawling information configuration, webpage information analysis, data storage configuration and data timing update, specific webpage information is crawled, analyzed data is stored in a MySQL database, and the timing update of the data is completed.
The method is developed in a task flow form, namely each crawler is an independent task flow.
Preferably, the crawling information configuration supports a general mode and a user-defined mode, the general mode is realized by inputting correct configuration parameters, in the user-defined mode, Yaml format parameters are preset, and a user selects used parameters according to requirements.
Preferably, the configuration items of the general mode include a request address, a request parameter and a request Header for acquiring webpage information data.
Preferably, the preset parameters include configuration of an acquisition token, a cookie, a password encryption mode and an authentication code parameter.
Preferably, the webpage information analysis is a test performed on the configuration after the configuration of the crawling information is completed, and if the test is not wrong, a webpage crawler is performed to acquire data, obtain a response message, and analyze the message.
Preferably, the webpage data parsing supports data in HTML, XML and JSON formats, wherein parsing of messages in the HTML and JSON formats is achieved through a Jsoup component, and parsing of messages in the XML formats is achieved through Dom4 j.
The method of the invention supports persisting crawled data into a database. It is noted that the configuration of the data store needs to be completed before storage.
Preferably, the data storage configuration supports the setting of the corresponding relation between the result of the webpage information analysis and the MySQL data table, and forms a configuration file to be persisted in the database.
The data storage configuration contains the following information: the data crawling task corresponds to the table in MySQL and corresponds to the table in the database; the corresponding relation between the message body of the response message of the task and the MySQL field, namely the corresponding relation between the field (or content node) of the message and the field of the MySQL table. Based on the data storage configuration, automatic creation and updating of a database table and table fields can be realized; meanwhile, the data returned by each crawling of each task can be stored in a data table (preset) corresponding to the MySQL database.
The crawler task in the invention supports two modes of independent operation and timing execution. The operation is independent, namely the operation is not executed after one time; and executing at fixed time, setting an execution interval, finally generating a Cron expression, and triggering the execution of the task after the rest turns.
Preferably, the data is updated at regular time, the setting of a unique key of the crawled data is set, and the data is updated based on the unique key as a data timing updating mark. For the task with changed data, a timing task can be selected for updating, the execution period of the timing task can be customized by using a Cron expression to obtain the latest data, and the latest data is updated to the database after comparison.
Preferably, the data timing update supports full-volume update and difference update, the full-volume update is to empty the data in the table and then insert the data in batch, and the difference update is to update only the change part.
Compared with the prior art, the configurable webpage information crawling method based on Java Web has the following outstanding beneficial effects: the configurable webpage information crawling method based on the Java Web can crawl specific webpage information through configuration, stores information contents into the MySQL database, can set reasonable updating time, can achieve the synchronization effect of crawling data information, and has good popularization and application values.
Drawings
FIG. 1 is a flowchart of a configurable method for crawling Web page information based on Java Web according to the present invention.
Detailed Description
The following describes the configurable Web page information crawling method based on Java Web in detail with reference to the accompanying drawings and embodiments.
Examples
As shown in fig. 1, the configurable webpage information crawling method based on Java Web of the present invention is shown in the form of task flows, each task flow includes crawling information configuration, webpage information parsing, data storage configuration and data timing update, specific webpage information is crawled, and parsed data is stored in a MySQL database, so as to complete timing update of data. The expansion is carried out in the form of task flows, namely, each crawler is an independent task flow.
The crawling information configuration supports a general mode and a custom mode.
General mode: by entering the correct configuration parameters, the configuration items include: and requesting an address, request parameters and a Header to acquire webpage information data.
A self-defining mode: this mode supports a high degree of customization. Various Yaml format parameters including some parameters of commonly used scenes are preset; the user selects parameters to be used according to the requirement of the user, and the preset parameters comprise the configuration of parameters such as token acquisition, cookie encryption, password encryption mode, verification code and the like.
And the webpage information analysis is a test performed on the configuration after the configuration of the crawling information is completed, and if the test is not wrong, a webpage crawler is performed to acquire data, obtain a response message, and analyze the message. The webpage data analysis supports data in HTML, XML and JSON formats, wherein the messages in the HTML and JSON formats are analyzed by using a Jsoup component, and the messages in the XML formats are analyzed by using Dom4 j. The method of the invention supports persisting crawled data into a database. It is noted that the configuration of the data store needs to be completed before storage.
And data storage configuration, which supports the setting of the corresponding relation between the result of webpage information analysis and the MySQL data table, and forms a configuration file to be persisted into a database.
The data storage configuration contains the following information: the data crawling task corresponds to the table in MySQL and corresponds to the table in the database; the corresponding relation between the message body of the response message of the task and the MySQL field, namely the corresponding relation between the field (or content node) of the message and the field of the MySQL table. Based on the data storage configuration, automatic creation and updating of a database table and table fields can be realized; meanwhile, the data returned by each crawling of each task can be stored in a data table (preset) corresponding to the MySQL database.
The crawler task in the invention supports two modes of independent operation and timing execution. The operation is independent, namely the operation is not executed after one time; and executing at fixed time, setting an execution interval, finally generating a Cron expression, and triggering the execution of the task after the rest turns.
And updating the data at regular time, setting the setting of a unique key of the crawled data, and updating the data based on the unique key as a data regular updating mark. For the task with changed data, a timing task can be selected for updating, the execution period of the timing task can be customized by using a Cron expression to obtain the latest data, and the latest data is updated to the database after comparison.
The data timing update supports full update and difference update, the full update is to empty the data in the table and then insert the data in batch, and the difference update is to update only the change part.
The configurable webpage information crawling method based on the Java Web can realize the series connection of a plurality of nodes by using a multi-task flow form, can realize the crawling of specific webpage information, stores the analyzed data into a MySQL database, and can realize the timing updating of the data. The crawl information can be set using generic and custom modes (using YAML) and can be tested. Web page information data in a variety of formats can be parsed by a variety of different components. And through configuration, an update table and a field can be automatically created, and the webpage information can be automatically analyzed to the MySQL database for storage. The webpage information data in the MySQL database can be updated regularly through various strategies.
The above-described embodiments are merely preferred embodiments of the present invention, and general changes and substitutions by those skilled in the art within the technical scope of the present invention are included in the protection scope of the present invention.

Claims (9)

1. A configurable webpage information crawling method based on JavaWeb is characterized in that: the method is displayed in the form of task flows, each task flow comprises crawling information configuration, webpage information analysis, data storage configuration and data timing update, specific webpage information is crawled, analyzed data are stored in a MySQL database, and timing update of the data is completed.
2. The Java Web-based configurable Web page information crawling method according to claim 1, wherein: the crawling information configuration supports a general mode and a user-defined mode, the general mode is realized by inputting correct configuration parameters, Yaml format parameters are preset in the user-defined mode, and a user selects the used parameters according to requirements.
3. The configurable java web-based web page information crawling method according to claim 2, wherein: the configuration items of the general mode comprise a request address, a request parameter and a request Header for acquiring webpage information data.
4. The configurable JavaWeb-based web page information crawling method according to claim 3, wherein: the preset parameters comprise the configuration of the acquisition token, the cookie, the password encryption mode and the verification code parameters.
5. The configurable JavaWeb-based web page information crawling method according to claim 4, wherein: and the webpage information analysis is a test performed on the configuration after the configuration of the crawling information is completed, and if the test is not wrong, a webpage crawler is performed to acquire data, obtain a response message and analyze the message.
6. The configurable JavaWeb-based web page information crawling method according to claim 5, wherein: the webpage data analysis supports data in HTML, XML and JSON formats, wherein the messages in the HTML and JSON formats are analyzed by using a Jsoup component, and the messages in the XML formats are analyzed by using Dom4 j.
7. The configurable JavaWeb-based web page information crawling method according to claim 6, wherein: and the data storage configuration supports the setting of the corresponding relation between the result of the webpage information analysis and the MySQL data table, and forms a configuration file to be persisted into a database.
8. The configurable JavaWeb-based web page information crawling method according to claim 7, wherein: and updating the data at regular time, setting the setting of a unique key of the crawled data, and updating the data based on the unique key as a data regular updating mark.
9. The Java Web-based configurable Web page information crawling method according to claim 8, wherein: the data timing update supports full update and difference update, the full update is to empty the data in the table and then insert the data in batch, and the difference update is to update only the change part.
CN202110094952.4A 2021-01-25 2021-01-25 Configurable webpage information crawling method based on Java Web Pending CN112800307A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110094952.4A CN112800307A (en) 2021-01-25 2021-01-25 Configurable webpage information crawling method based on Java Web

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110094952.4A CN112800307A (en) 2021-01-25 2021-01-25 Configurable webpage information crawling method based on Java Web

Publications (1)

Publication Number Publication Date
CN112800307A true CN112800307A (en) 2021-05-14

Family

ID=75811478

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110094952.4A Pending CN112800307A (en) 2021-01-25 2021-01-25 Configurable webpage information crawling method based on Java Web

Country Status (1)

Country Link
CN (1) CN112800307A (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100138485A1 (en) * 2008-12-03 2010-06-03 William Weiyeh Chow System and method for providing virtual web access
CN109597952A (en) * 2018-12-10 2019-04-09 江苏满运软件科技有限公司 Web information processing method, system, electronic equipment and storage medium
CN110245278A (en) * 2018-09-05 2019-09-17 爱信诺征信有限公司 Acquisition method, device, electronic equipment and the storage medium of web data

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100138485A1 (en) * 2008-12-03 2010-06-03 William Weiyeh Chow System and method for providing virtual web access
CN110245278A (en) * 2018-09-05 2019-09-17 爱信诺征信有限公司 Acquisition method, device, electronic equipment and the storage medium of web data
CN109597952A (en) * 2018-12-10 2019-04-09 江苏满运软件科技有限公司 Web information processing method, system, electronic equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
程序O人生: "java定时爬取数据", 《CSDN》 *

Similar Documents

Publication Publication Date Title
CN102567539B (en) Intelligent WEB report implementation method and intelligent WEB report implementation system
US7216340B1 (en) Analysis data validation tool for use in enterprise architecture modeling with result based model updating
JP5065056B2 (en) Method, computer program, and system for processing a workflow (integrating data management operations into a workflow system)
US7707544B2 (en) System and method for generating and reusing software application code with source definition files
US7424702B1 (en) Data integration techniques for use in enterprise architecture modeling
CN112650766B (en) Database data operation method, system and server
US10324929B2 (en) Provision of position data for query runtime errors
EP1585036A2 (en) Management of parameterized database queries
US20080250394A1 (en) Synchronizing external documentation with code development
US9361398B1 (en) Maintaining a relational database and its schema in response to a stream of XML messages based on one or more arbitrary and evolving XML schemas
US20170031661A1 (en) Systems and methods for transactional applications in an unreliable wireless network
WO2004086222A2 (en) Development of software systems
US6567819B1 (en) Run time objects
US20140244680A1 (en) Sql query parsing and translation
US7831614B2 (en) System and method for generating SQL using templates
CN107315764B (en) Method and system for updating non-relational database associated data
CN109426725A (en) Data desensitization method, equipment and computer readable storage medium
CN104679500B (en) Method and device for realizing automatic generation of entity class
CN112434059A (en) Data processing method, data processing device, computer equipment and storage medium
CN105975489A (en) Metadata-based online SQL code completion method
US7650276B2 (en) System and method for dynamic data binding in distributed applications
CN103927168A (en) Object-oriented data model persistence method and device
CN111881043B (en) Page testing method and device, storage medium and processor
CN112800307A (en) Configurable webpage information crawling method based on Java Web
CN101662843A (en) Fast establishing method of WAP website

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20210514

RJ01 Rejection of invention patent application after publication