CN114491206A - General low-code crawler method and system for news blog websites - Google Patents

General low-code crawler method and system for news blog websites Download PDF

Info

Publication number
CN114491206A
CN114491206A CN202210001246.5A CN202210001246A CN114491206A CN 114491206 A CN114491206 A CN 114491206A CN 202210001246 A CN202210001246 A CN 202210001246A CN 114491206 A CN114491206 A CN 114491206A
Authority
CN
China
Prior art keywords
article
website
data
configuration
crawling
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210001246.5A
Other languages
Chinese (zh)
Inventor
杨国武
谈振伟
杜佩佩
孙相鹏
董广县
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Original Assignee
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China filed Critical University of Electronic Science and Technology of China
Priority to CN202210001246.5A priority Critical patent/CN114491206A/en
Publication of CN114491206A publication Critical patent/CN114491206A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/954Navigation, e.g. using categorised browsing

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Remote Sensing (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a universal low-code crawler method and a system for news blog websites, and belongs to the technical field of web crawlers. The method comprises the following core steps: creating a configuration file of each website to be crawled; selecting an operation mode and loading various configurations of a target website; extracting all links and category names of the category navigation in the navigation bar of the target website; extracting all article links of the article list page and adding the article links into a list to be crawled for each classified navigation link; extracting various information in article page resources and performing persistent storage on each article in a list to be crawled; and repeating the execution until all the crawling tasks are completed. The crawler system provides capabilities mainly including custom function expansion, multi-task management, multiple modes of persistent storage, compatibility with different types of websites, log and progress management and the like. According to the method, the article crawling meeting basic requirements can be completed only by adding various configurations of the website to be crawled, and the development and maintenance efficiency of the crawler program is greatly improved.

Description

General low-code crawler method and system for news blog websites
Technical Field
The invention belongs to the technical field of web crawlers, and particularly relates to a universal low-code crawler method and a system for news blog websites.
Background
In the internet era of information explosion, it is no longer feasible to collect data by manpower, and the web crawler program becomes an important means for acquiring various network data resources. The method for collecting the news/article data (note: news and articles refer to the same concept in the patent description of the invention) is a main data collecting method by crawling various news/article data from various news blog websites (particularly news websites), and the collected data can be generally used for database index construction, news resource integration, data mining or AI model training and the like.
Currently, the mainstream crawling techniques are divided into the following two categories:
the first is to develop a specific crawler for different websites.
Different websites have different content organization modes and different webpage structures, so that when the required data is required to be acquired, each website needs to be manually subjected to element analysis, and specific codes need to be written for processing from the acquisition of webpage links to the data cleaning.
The disadvantages of the above approach are readily highlighted when developing multi-data-source crawlers (crawlers of multiple websites). Firstly, each website needs to perform specific analysis and compile specific codes for processing, which brings a great deal of time and labor consumption; secondly, the writing codes of all websites are different, and the websites are difficult to maintain when the number of the webpages is large; finally, the overall architecture of each website may change at any time, and each small structural change may disable the crawler that is only adapted to a specific structural website.
The second is a mainstream based crawler framework.
With the development of the crawler technology, some mature crawler frames such as Scapy integrate the general functions of webpage collection into each module, and programmers only need to pay attention to the crawling requirements of themselves when developing the crawler program and directly call each module to complete the development of the crawler program according to the use specifications of the frames.
However, the method is not perfect in practical application, a mature framework means higher development and learning cost, and the crawler framework gives consideration to various crawler requirements, but many functions still need to be written by self when facing specific fields and requirements, and the development cost is still higher when crawler programs of multiple websites are developed. In addition, the frame-based writing may also bring about more memory consumption and higher debugging difficulty.
As most news blog websites have certain universality on the layout structure of the webpage, the layout is usually carried out according to the structure of 'classified navigation' -each classified lower chapter list '-article specific content', so that the crawler development framework for the websites can be designed to efficiently realize the development and maintenance of a crawler program, and the defects of the two main technical routes can be simultaneously solved.
Disclosure of Invention
Aiming at the problems in the prior art, the invention provides a universal low-code crawler method and a system for a news blog website, which can complete news crawling meeting basic requirements only by manually adding various configurations of the website to be crawled, and greatly improve the development and maintenance efficiency of crawler programs of the news blog website.
The technical scheme adopted by the invention is as follows:
a universal low code crawler method for news blog websites is characterized by comprising the following steps:
step 1: manually creating configuration files of all websites to be crawled and filling the configuration files into configuration items of all websites to be crawled;
and 2, step: starting to operate, selecting an operation mode, determining a target website in the websites to be crawled, loading each configuration item of the target website and verifying the configuration;
and step 3: requesting and acquiring a home page of a target website, and extracting all classified navigation links and corresponding classified names of navigation bars;
and 4, step 4: for each classified navigation link, requesting and acquiring an article list page corresponding to the classified navigation link, extracting all article links, and adding the article links into a list to be crawled;
and 5: requesting and acquiring webpage resources corresponding to the chapter links in the list to be crawled, and extracting various information in the webpage resources according to the corresponding configuration files to serve as crawled data;
step 6: intelligently checking the crawled data, filtering abnormal data and storing the abnormal data into a database or a local file;
and 7: and repeating the step 5 and the step 6 until the crawling of all article links in the list to be crawled is completed, so that the crawling of the target website is completed.
Further, the configuration file created in step 1 includes basic configuration, storage configuration, configuration of each data item selector and other configurations; the basic configuration comprises information configuration such as name, type, coding format and home page address of a website, and the website type comprises a dynamic website and a static website; the storage configuration comprises a data storage mode, a local storage path, configuration of various information of a database and configuration of a storage directory of a log and a progress; each data item selector is configured to comprise a classified navigation link, an article link, article content, an article abstract, an article title, article date and a data extraction expression type of an article author, wherein the data extraction expression type specifically comprises a cs data extraction function and an xpath data extraction function; other configurations include network request frequency, maximum data crawling amount of single operation, multi-category navigation and crawling configuration items, and paging parameter configuration, page turning operation configuration and start page number of an article list page.
Further, the operation mode in the step 2 comprises single website crawling, multi-website serial crawling and multi-website concurrent crawling.
Further, when the selected operation mode is single website crawling, step 2 determines the target website through command line interaction.
Further, when the selected operation mode is multi-website serial crawling, step 2 takes all websites to be crawled as target websites in sequence.
Further, when the selected operation mode is multi-website concurrent crawling, step 2 automatically determines the number of task branches performed simultaneously according to the system resource load condition, and performs crawling on the next target website after the crawling of the last target website of each task branch is completed until all the websites to be crawled are used as the target websites.
Further, the specific process of step 3 is:
step 3.1: according to the home page address and the website type of the target website filled in the configuration file, sending a request to the target website, verifying a state code and a webpage text fed back by the request, and if the state code is 200, successfully acquiring the home page of the target website; if the request fails to be sent or the status code is not 200, the request is repeatedly sent for at most 3 times, if the request still fails, the crawling of the target website is stopped, and meanwhile, corresponding error prompt information is output and recorded into a log;
step 3.2: converting the home page of the target website into a target website structured object containing each element node of the home page of the target website;
step 3.3: and extracting all the classified navigation links and the corresponding classified names of the navigation bars from the structured object of the target website based on the data extraction expression type of the classified navigation links filled in the configuration file.
Further, the specific process of step 4 is:
step 4.1: for each classified navigation link, according to the classified navigation link and the website type, sending a request to an article list page pointed by the classified navigation link, verifying a state code and a webpage text fed back by the request, and if the state code is 200, successfully acquiring the article list page; if the request fails to be sent or the status code is not 200, the request is repeatedly sent for at most 3 times, if the request still fails, the crawling of the article list page pointed by the classified navigation link is ended, and meanwhile, corresponding error prompt information is output and recorded into a log;
step 4.2: converting the article list page into an article list page structured object containing each element node of the article list page;
step 4.3: extracting expression types based on data of article links filled in the configuration file, extracting all article links from the article list page structured object, judging whether the extracted article links are complete or not, if not, carrying out intelligent splicing completion processing, and adding the completed article links into a list to be crawled.
Further, the configuration file created in step 1 further includes a multi-category navigation and crawling configuration item, and if the multi-category navigation and crawling configuration item is enabled in the target website, step 4 crawls a plurality of category navigation links at the same time.
Further, the specific process of step 5 is:
step 5.1: according to the article link and the website type, sending a request to an article webpage resource pointed by the article link, verifying a state code and a webpage text fed back by the request, and if the state code is 200, successfully acquiring the article webpage resource; if the request fails to be sent or the status code is not 200, the request is repeatedly sent for at most 3 times, if the request still fails, the current article link is skipped over, crawling of the article webpage resource pointed by the next article link in the list to be crawled is continued, and meanwhile, corresponding error prompt information is output and recorded into a log;
step 5.2: converting the article webpage resources into article webpage resource structured objects containing all element nodes of the article webpage resources;
step 5.3: extracting corresponding article content, article abstract, article title, article date and article author from an article webpage resource structured object as crawling data based on the article content, the article abstract, the article title, the article date and the data extraction expression type of the article author which are filled in the configuration file; and if the webpage resources have information loss, performing gap filling value processing, wherein when the article date is lost, the gap filling is the time for acquiring the current article webpage resources.
Further, the specific process of step 6 is:
step 6.1: intelligently checking each field of the crawled data, if the field is missing or abnormal, judging that invalid data is not stored, outputting an invalid information prompt and recording a corresponding article link into a log;
step 6.2: taking article contents and article titles in the crawled data successfully verified intelligently as a duplication judgment condition, judging whether the same data exist in a database or a local file, and recording article links into a log if the same data exist in the database or the local file; otherwise, storing the crawled data into a database or a local file.
Further, in step 1, according to actual requirements, adding a custom processing code segment in a specified code file in a crawler code file directory, and after extracting the crawled data in step 5, executing the custom processing code segment; the functions executed by the custom processing code segment comprise additional information extraction, format conversion, secondary processing or custom filtering and screening and the like.
A general low code crawler system for news blog websites is characterized by comprising a configuration loading module, a page resource loading module, a data extraction module, a data storage module, an asynchronous multi-task management module and a log and progress management module;
the configuration loading module is used for determining a target website according to the selected operation mode, loading a configuration file of the created target website, performing configuration verification, initializing based on each configuration item after the verification is successful, and starting other modules of the system to operate;
the page resource loading module is used for selecting a corresponding page loading function based on the configured website type according to the input home page address of the target website, the classification navigation link or the article link, and loading corresponding page resources;
the data extraction module is used for identifying the configured classified navigation link, article content, article abstract, article title, article date or data extraction expression type of an article author, selecting a css data extraction function or an xpath data extraction function according to the identification result, and extracting corresponding data information of each page resource;
the data storage module is used for intelligently verifying the data information extracted from the page resources corresponding to each article link, and after the intelligent verification is successful, the data information is correspondingly stored in a database or a local file based on the configured data storage mode, a local storage path or the configuration of each item of information in the database, and the article repeatability verification is carried out;
the log and progress management module records error prompt information of page resource request failure and article link failed in intelligent verification or repeated based on the configured log and the storage directory configuration of the progress into a local log file, simultaneously records the crawling progress of a target website in the local progress record file in real time, and can continue the previous crawling progress when the next operation or abnormal exit rerun operation is performed;
and the asynchronous multi-task management module is used for scheduling based on node.js asynchronous processing capability and completing multi-website concurrent crawling or multi-classification navigation crawling in cooperation with other modules.
Further, the data extraction module runs a custom processing code segment after extracting the data information of the page resource corresponding to the article link.
The invention has the beneficial effects that:
1. the invention provides a general low-code crawler method and a system for a news blog website, which can finish news crawling meeting basic requirements only by manually adding various configurations of the website to be crawled, are suitable for most news blog websites, and greatly improve the development and maintenance efficiency of crawler programs of the news blog websites compared with the conventional code compiling development;
2. the method has log and progress management functions, if external factors such as a fault or network interruption occur in the system, the crawling process is interrupted, the last interrupted record is searched in a local log file or a local progress record file, the crawling progress can be continuously connected for continuous operation, the crawling efficiency is improved, the integrity of data is ensured, and meanwhile, the crawling progress cannot be influenced when a developer checks an abnormal webpage;
3. the method has the asynchronous multi-task management function, realizes multi-website concurrent crawling or multi-classification navigation crawling, and greatly improves the crawling efficiency of massive crawling requirements;
4. preferably, a custom processing code segment is added in a specified code file in the crawler code file directory according to actual requirements, so that the crawling function can be flexibly and conveniently extended on the basis of a general low-code crawler system meeting basic requirements, and the method has universality.
Drawings
Fig. 1 is a schematic flowchart of a general low-code crawler method for a news blog-like website according to embodiment 1 of the present invention;
fig. 2 is an architecture diagram of a general low-code crawler system for a news blog-like website according to embodiment 1 of the present invention;
fig. 3 is a structural diagram of a page resource loading module in the universal low-code crawler system for news blog websites according to embodiment 1 of the present invention;
fig. 4 is a structural diagram of a data extraction module in the universal low-code crawler system for news blog websites according to embodiment 1 of the present invention;
fig. 5 is a structural diagram of a data storage module in the universal low-code crawler system for news blog type websites according to embodiment 1 of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Example 1
The embodiment provides a node.js-based developed universal low-code crawler system for news blog websites, which has a structure shown in fig. 2 and comprises a configuration loading module, a page resource loading module, a data extraction module, a data storage module, an asynchronous multi-task management module and a log and progress management module; wherein the content of the first and second substances,
as shown in fig. 3, the page resource loading module includes a URL intelligent splicing processing module, a puppeteer-based dynamic page loading module, and an axios-based static page loading module;
as shown in fig. 4, the data extraction module includes a selector type identification module, a data object generation module, a cs-based data extraction module, and an xpath-based data extraction module;
as shown in fig. 5, the data storage module includes a data object checking module, a json file storage module, a csv file storage module, a MySQL file storage module, and a MongoDB file storage module.
A flowchart of a crawler method based on the above generic low-code crawler system is shown in fig. 1, and includes the following steps:
step 1: manually creating json configuration files of all websites to be crawled and filling in all configuration items, wherein the configuration files comprise basic configuration, storage configuration, configuration of all data item selectors and other configuration; the basic configuration comprises information configuration such as name, type, coding format and home page address of a website, and the website type comprises a dynamic website and a static website; the storage configuration comprises a data storage mode, a local storage path, configuration of various information of a database and configuration of a storage directory of a log and a progress; each data item selector is configured with a classified navigation link, an article link, article content, an article abstract, an article title, an article date and a data extraction expression type of an article author, and specifically has a cs data extraction function and an xpath data extraction function, wherein the cs data extraction function is concise in writing and faster in speed, but the functions are not strong enough, the xpath data extraction function is stronger and can be used for positioning complex content, but the writing method is more complex and the performance is slightly slower; the other configurations comprise network request frequency, maximum data crawling amount of single-time operation, multi-classification navigation concurrent crawling configuration items, plain text time format conversion configuration, and paging parameter configuration, page turning operation configuration and starting page numbers of article list pages, wherein the multi-classification navigation concurrent crawling configuration items comprise whether the maximum task concurrency number of the multi-classification navigation concurrent crawling and the classification navigation is started or not;
the specific description of each configuration item in the configuration file refers to the following table:
Figure BDA0003454278490000061
Figure BDA0003454278490000071
Figure BDA0003454278490000081
in addition, according to actual requirements, a JavaScript language is adopted, a custom processing code segment is added in a specified code file in a crawler code file directory according to a system naming specification, specifically, a custom processing function is inserted, the parameters of the function comprise a current webpage html text and a currently extracted data object, a developer can add custom functions such as picture downloading, data item filtering and screening, data formatting and the like at the position, and the data object after the custom processing is returned at the tail end of the function.
Step 2: starting operation, selecting an operation mode of single-website crawling, multi-website serial crawling or multi-website concurrent crawling, determining a target website in the websites to be crawled, loading a json configuration file of the created target website by a configuration loading module, checking whether each configuration item in the json configuration file is complete and meets the requirement, and prompting error information if the configuration items in the json configuration file do not meet the requirement; initializing and starting other modules of the system to operate based on each configuration item after the verification is successful, and completing the I/O initialization operations such as database connection or local file read-write flow creation, crawling progress reading and the like;
when the selected operation mode is single-website crawling, determining a target website through command line interaction; when the selected operation mode is multi-website serial crawling, all websites to be crawled are sequentially used as target websites; when the selected operation mode is multi-website concurrent crawling, automatically determining the number of task branches which are performed simultaneously according to the system resource load condition, independently performing target website crawling on each task branch simultaneously under the scheduling and cooperation of the asynchronous multi-task management module, and crawling a next target website after the crawling of a previous target website of each task branch is completed until all websites to be crawled are used as target websites;
and step 3: requesting and acquiring a home page of a target website, and extracting all classified navigation links and corresponding classified names of navigation bars, wherein the specific process comprises the following steps:
step 3.1: taking a home page address of a target website filled in a configuration file as the input of a page resource loading module, selecting a puppeteer-based dynamic page loading module or an axios-based static page loading module to send a request to the target website according to the type of the filled website, verifying a state code and a webpage text fed back by the request, and if the state code is 200, successfully obtaining a home page of the html target website; if the request fails to be sent or the status code is not 200, the request is repeatedly sent for at most 3 times, if the request still fails, the crawling of the target website is stopped, meanwhile, corresponding error prompt information is output, and the error prompt information is recorded into a local log file through a log and progress management module;
step 3.2: converting the html target website home page into a target website structured object containing each element node of the target website home page;
step 3.3: and a selector type identification module in the data extraction module identifies the data extraction expression type of the configured classified navigation links by a regular matching method, selects a cs-based data extraction module or an xpath-based data extraction module according to the identification result of the selector type, and extracts all the classified navigation links and the corresponding category names of the navigation bar from the structured object of the target website.
And 4, step 4: for each classified navigation link, requesting and acquiring an article list page corresponding to the classified navigation link, extracting all article links, and adding a list to be crawled, wherein the specific process comprises the following steps:
step 4.1: for each classified navigation link, the classified navigation link is used as the input of a page resource loading module, a puppeteer-based dynamic page loading module or an axios-based static page loading module is selected to send a request to an article list page pointed by the classified navigation link according to the type of a filled website, a state code and a webpage text fed back by the request are verified, and if the state code is 200, the article list page is successfully acquired; if the request fails to be sent or the status code is not 200, the request is repeatedly sent for at most 3 times, if the request still fails, the crawling of the article list page pointed by the classified navigation link is ended, meanwhile, corresponding error prompt information is output and is recorded into a local log file through a log and progress management module;
step 4.2: converting the article list page into an article list page structured object containing each element node of the article list page;
step 4.3: the selector type identification module identifies the data extraction expression type of the configured article link through a regular matching method, selects a cs-based data extraction module or an xpath-based data extraction module according to the identification result of the selector type, and extracts all URL article links from the article list page structured object; and then judging whether the extracted URL article link is complete or not by utilizing a URL intelligent splicing processing module, if not, carrying out intelligent splicing completion processing, and adding the completed URL article link into a list to be crawled.
And 5: requesting and acquiring article webpage resources corresponding to URL article links in a list to be crawled, extracting various information in the article webpage resources as crawled data, and comprising the following specific processes:
step 5.1: taking the URL article link as the input of a page resource loading module, selecting a puppeteer-based dynamic page loading module or an axios-based static page loading module to send a request to the article webpage resource pointed by the article link according to the type of the filled website, verifying a state code and a webpage text fed back by the request, and if the state code is 200, successfully acquiring the article webpage resource; if the request fails to be sent or the status code is not 200, the request is repeatedly sent for at most 3 times, if the request still fails, the current article link is skipped over, crawling of the article webpage resource pointed by the next article link in the list to be crawled is continued, meanwhile, corresponding error prompt information is output, and the error prompt information is recorded into a local log file through a log and progress management module;
step 5.2: converting the article webpage resources into article webpage resource structured objects containing all element nodes of the article webpage resources;
step 5.3: the selector type identification module identifies the configured article content, article abstract, article title, article date and data extraction expression type of an article author through a regular matching method, selects a css-based data extraction module or an xpath-based data extraction module according to the identification result of the selector type, extracts the corresponding article content, article abstract, article title, article date and article author from the article webpage resource, and integrally fuses a news data object to be stored through a data object generation module to serve as crawl data;
in the whole fusion process, if the webpage resources have information loss, performing gap filling value processing through a data object generation module, wherein when the article date is lost, the gap filling is the time for acquiring the current article webpage resources; formatting the field of the article date according to a date filling format in the configuration file;
if the user-defined processing code segment is added in the step 1, the selected data extraction module needs to execute the user-defined processing code segment; the functions executed by the custom processing code segment comprise additional information extraction, format conversion, secondary processing or custom filtering and screening and the like.
Step 6: the method comprises the following steps of intelligently checking the crawled data, filtering abnormal data and storing the abnormal data into a database or a local file, wherein the specific process comprises the following steps:
step 6.1: a data object checking module in the data storage module intelligently checks each field of the crawled data, if the field is missing or abnormal, the data object checking module judges that invalid data is not stored, outputs an invalid information prompt and records a corresponding article link into a local log file through a log and progress management module;
step 6.2: the data object verification module also takes article contents and article titles in the crawled data successfully verified intelligently as a duplication judgment condition, judges whether the same data exist in a database or a local file, and records article links into the local log file through the log and progress management module if the same data exist in the database or the local file; otherwise, storing the crawled data into a json file storage module, a csv file storage module, a MySQL file storage module or a MongoDB file storage module, wherein the json file storage module and the csv file storage module are used for storing local files, and the MySQL file storage module and the MongoDB file storage module are used for storing databases.
And 7: and repeating the step 5 and the step 6 until the crawling of all article links in the list to be crawled is completed, so that the crawling of the target website is completed.
Further, if the target website enables multi-category navigation and crawls configuration items, step 4 crawls a plurality of category navigation links simultaneously under the scheduling and cooperation of the asynchronous multi-task management module.
Furthermore, the general low-code crawler system records the crawling progress of the target website in a local progress recording file in real time through the log and progress management module, reads the progress when the system operates next time or abnormally quits the rerun operation, and continues the previous crawling progress.

Claims (10)

1. A universal low code crawler method for news blog websites is characterized by comprising the following steps:
step 1: for all websites to be crawled, creating configuration files of all websites to be crawled and filling the configuration files into all configuration items;
step 2: starting to operate, selecting an operation mode, determining a target website in the websites to be crawled, loading each configuration item of the target website and verifying the configuration;
and step 3: requesting and acquiring a home page of a target website, and extracting all classified navigation links and corresponding classified names of navigation bars;
and 4, step 4: for each classified navigation link, requesting and acquiring an article list page corresponding to the classified navigation link, extracting all article links, and adding the article links into a list to be crawled;
and 5: requesting and acquiring webpage resources corresponding to the chapter links in the list to be crawled, and extracting various information in the webpage resources according to the configuration file to serve as crawled data;
and 6: intelligently checking the crawled data, filtering abnormal data and storing the abnormal data into a database or a local file;
and 7: and repeating the step 5 and the step 6 until the crawling of all article links in the list to be crawled is completed, so that the crawling of the target website is completed.
2. The universal low-code crawler method for news blog like web sites as claimed in claim 1, wherein the configuration file created in step 1 comprises basic configuration, storage configuration, configuration of each data item selector and other configurations; the basic configuration comprises the name, the type, the coding format and the home page address of a website, and the type of the website comprises a dynamic website and a static website; the storage configuration comprises a data storage mode, a local storage path, configuration of various information of a database and configuration of a storage directory of a log and a progress; each data item selector comprises a classified navigation link, an article link, article content, an article abstract, an article title, article date and a data extraction expression type of an article author, and has a cs data extraction function and an xpath data extraction function; other configurations include network request frequency, maximum data crawling amount of single operation, multi-category navigation and crawling configuration items, and paging parameter configuration, page turning operation configuration and start page number of an article list page.
3. The universal low-code crawler method for news blog like websites according to claim 1, wherein the operation mode in step 2 comprises single website crawling, multi-website serial crawling and multi-website concurrent crawling; when the selected operation mode is single-website crawling, step 2 determines a target website through command line interaction; when the selected operation mode is multi-website serial crawling, step 2 sequentially takes all websites to be crawled as target websites; when the selected operation mode is multi-website concurrent crawling, step 2 automatically determines the number of task branches which are performed simultaneously according to the system resource load condition, and crawls the next target website after the crawling of the last target website of each task branch is completed until all the websites to be crawled are used as the target websites.
4. The universal low-code crawler method for news blog type websites according to claim 2, wherein the specific process of the step 3 is:
step 3.1: according to the home page address and the website type of the target website, sending a request to the target website, verifying a state code and a webpage text fed back by the request, and if the state code is 200, successfully acquiring the home page of the target website; if the request fails to be sent or the status code is not 200, the request is repeatedly sent for at most 3 times, if the request still fails, the crawling of the target website is stopped, and meanwhile, corresponding error prompt information is output and recorded into a log;
step 3.2: converting the home page of the target website into a target website structured object containing each element node of the home page of the target website;
step 3.3: and extracting all the classified navigation links and the corresponding category names of the navigation bars from the structured object of the target website based on the data extraction expression types of the classified navigation links.
5. The universal low-code crawler method for news blog type websites according to claim 2, wherein the specific process of the step 4 is:
step 4.1: for each classified navigation link, according to the classified navigation link and the website type, sending a request to an article list page pointed by the classified navigation link, verifying a state code and a webpage text fed back by the request, and if the state code is 200, successfully acquiring the article list page; if the request fails to be sent or the status code is not 200, the request is repeatedly sent for at most 3 times, if the request still fails, the crawling of the article list page pointed by the classified navigation link is ended, and meanwhile, corresponding error prompt information is output and recorded into a log;
step 4.2: converting the article list page into an article list page structured object containing each element node of the article list page;
step 4.3: extracting expression types based on data of the article links, extracting all article links from the article list page structured objects, judging whether the extracted article links are complete or not, if the extracted article links are incomplete, carrying out intelligent splicing completion processing, and adding the completed article links into a list to be crawled.
6. The universal low-code crawler method for news blog websites according to claim 1, wherein the configuration file created in step 1 further comprises multi-category navigation and crawl configuration items, and if the target website enables the multi-category navigation and crawl the configuration items, step 4 crawls a plurality of category navigation links at the same time.
7. The universal low-code crawler method for news blog type websites according to claim 2, wherein the specific process of the step 5 is:
step 5.1: according to the article link and the website type, sending a request to an article webpage resource pointed by the article link, verifying a state code and a webpage text fed back by the request, and if the state code is 200, successfully acquiring the article webpage resource; if the request fails to be sent or the status code is not 200, the request is repeatedly sent for at most 3 times, if the request still fails, the current article link is skipped over, crawling of the article webpage resource pointed by the next article link in the list to be crawled is carried out, and meanwhile, corresponding error prompt information is output and recorded into a log;
step 5.2: converting the article webpage resources into article webpage resource structured objects containing all element nodes of the article webpage resources;
step 5.3: extracting corresponding article content, article abstract, article title, article date and article author from an article webpage resource structured object based on the article content, the article abstract, the article title, the article date and the data extraction expression type of the article author as crawling data; and if the webpage resources have information loss, performing gap filling value processing, wherein when the article date is lost, the gap filling is the time for acquiring the current article webpage resources.
8. The universal low-code crawler method for news blog type websites according to claim 2, wherein the specific process of the step 6 is:
step 6.1: intelligently checking each field of the crawled data, if the field is missing or abnormal, judging that invalid data is not stored, outputting an invalid information prompt and recording a corresponding article link into a log;
step 6.2: taking article contents and article titles in the crawled data successfully verified intelligently as a duplication judgment condition, judging whether the same data exist in a database or a local file, and recording article links into a log if the same data exist in the database or the local file; otherwise, storing the crawled data into a database or a local file.
9. The universal low-code crawler method for news blog type websites according to claim 1, wherein a custom processing code segment is added to a designated code file in a crawler code file directory according to actual requirements in step 1, and the custom processing code segment is executed after the crawled data is extracted in step 5; the functions executed by the custom processing code segment comprise additional information extraction, format conversion, secondary processing or custom filtering and screening.
10. A general low code crawler system for news blog websites is characterized by comprising a configuration loading module, a page resource loading module, a data extraction module, a data storage module, an asynchronous multi-task management module and a log and progress management module;
the configuration loading module is used for determining a target website according to the selected operation mode, loading a configuration file of the created target website, performing configuration verification, initializing based on each configuration item after the verification is successful, and starting other modules of the system to operate;
the page resource loading module is used for selecting a corresponding page loading function based on the configured website type according to the input home page address of the target website, the classification navigation link or the article link, and loading corresponding page resources;
the data extraction module is used for identifying the configured classified navigation link, article content, article abstract, article title, article date or data extraction expression type of an article author, selecting a css data extraction function or an xpath data extraction function according to the identification result, and extracting corresponding data information of each page resource;
the data storage module is used for intelligently checking the data information extracted from the page resources corresponding to each article link, and after the intelligent checking is successful, the data information is correspondingly stored in a database or a local file based on the configured data storage mode, the local storage path or each information configuration of the database, and the article repeatability is checked;
the log and progress management module records error prompt information of page resource request failure and article link failed in intelligent verification or repeated in a local log file based on the configured log and the configured storage directory configuration of the progress, simultaneously records the crawling progress of a target website in the local progress record file in real time, and continues the crawling progress before next operation or abnormal exit rerun operation;
and the asynchronous multi-task management module is used for scheduling based on node.js asynchronous processing capability and completing multi-website concurrent crawling or multi-classification navigation crawling in cooperation with other modules.
CN202210001246.5A 2022-01-04 2022-01-04 General low-code crawler method and system for news blog websites Pending CN114491206A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210001246.5A CN114491206A (en) 2022-01-04 2022-01-04 General low-code crawler method and system for news blog websites

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210001246.5A CN114491206A (en) 2022-01-04 2022-01-04 General low-code crawler method and system for news blog websites

Publications (1)

Publication Number Publication Date
CN114491206A true CN114491206A (en) 2022-05-13

Family

ID=81509242

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210001246.5A Pending CN114491206A (en) 2022-01-04 2022-01-04 General low-code crawler method and system for news blog websites

Country Status (1)

Country Link
CN (1) CN114491206A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114816373A (en) * 2022-06-30 2022-07-29 金现代信息产业股份有限公司 Instant error prompt method and system for low-code development platform
CN114860673A (en) * 2022-07-06 2022-08-05 南京聚铭网络科技有限公司 Log feature identification method and device based on dynamic and static combination
CN117573959A (en) * 2023-10-17 2024-02-20 北京国科众安科技有限公司 General method for obtaining news text based on web page xpath

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109829096A (en) * 2019-03-15 2019-05-31 北京金山数字娱乐科技有限公司 A kind of collecting method, device, electronic equipment and storage medium
CN109902220A (en) * 2019-02-27 2019-06-18 腾讯科技(深圳)有限公司 Webpage information acquisition methods, device and computer readable storage medium
CN110597981A (en) * 2019-09-16 2019-12-20 西华大学 Network news summary system for automatically generating summary by adopting multiple strategies
CN111723265A (en) * 2020-07-01 2020-09-29 杭州叙简科技股份有限公司 Extensible news website universal crawler method and system
CN113626674A (en) * 2021-08-03 2021-11-09 杭州隆埠科技有限公司 News collecting system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109902220A (en) * 2019-02-27 2019-06-18 腾讯科技(深圳)有限公司 Webpage information acquisition methods, device and computer readable storage medium
CN109829096A (en) * 2019-03-15 2019-05-31 北京金山数字娱乐科技有限公司 A kind of collecting method, device, electronic equipment and storage medium
CN110597981A (en) * 2019-09-16 2019-12-20 西华大学 Network news summary system for automatically generating summary by adopting multiple strategies
CN111723265A (en) * 2020-07-01 2020-09-29 杭州叙简科技股份有限公司 Extensible news website universal crawler method and system
CN113626674A (en) * 2021-08-03 2021-11-09 杭州隆埠科技有限公司 News collecting system

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114816373A (en) * 2022-06-30 2022-07-29 金现代信息产业股份有限公司 Instant error prompt method and system for low-code development platform
CN114860673A (en) * 2022-07-06 2022-08-05 南京聚铭网络科技有限公司 Log feature identification method and device based on dynamic and static combination
CN117573959A (en) * 2023-10-17 2024-02-20 北京国科众安科技有限公司 General method for obtaining news text based on web page xpath
CN117573959B (en) * 2023-10-17 2024-04-05 北京国科众安科技有限公司 General method for obtaining news text based on web page xpath

Similar Documents

Publication Publication Date Title
CN114491206A (en) General low-code crawler method and system for news blog websites
CN111832236B (en) Chip regression testing method and system, electronic equipment and storage medium
US9298680B2 (en) Display of hypertext documents grouped according to their affinity
CN106126648B (en) It is a kind of based on the distributed merchandise news crawler method redo log
CN105243159A (en) Visual script editor-based distributed web crawler system
CN114077534B (en) Test case generation method, device and computer readable storage medium
CN111813443B (en) Method and tool for automatically filling code sample by using Java FX
Chasins et al. Skip blocks: reusing execution history to accelerate web scripts
CN113901169A (en) Information processing method, information processing device, electronic equipment and storage medium
Jahanbin et al. Intelligent run-time partitioning of low-code system models
CN116088846A (en) Processing method, related device and equipment for continuous integrated code format
CN112328246A (en) Page component generation method and device, computer equipment and storage medium
CN110297960A (en) A kind of distributed DOC DATA acquisition system based on configuration
CN113918460A (en) Page testing method, device, equipment and medium
Bowie Applications of graph theory in computer systems
CN115454382A (en) Demand processing method and device, electronic equipment and storage medium
CN114596070A (en) Product optimization design platform construction method based on knowledge graph
CN113094122A (en) Method and device for executing data processing script
CN116010452A (en) Industrial data processing system and method based on stream type calculation engine and medium
CN112422707A (en) Domain name data mining method and device and Redis server
Bernardi et al. The re-uwa approach to recover user centered conceptual models from web applications
Zhu et al. A supporting tool for syntactic analysis of SOFL formal specifications and automatic generation of functional scenarios
Di Lucca et al. Web pages classification using concept analysis
JP3452960B2 (en) Object type model creation system and fault diagnosis system creation system
Nagarajan et al. VISTA--a visual interface for software reuse in TROMLAB environment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination