CN112487268A

CN112487268A - Data crawling implementation method based on distributed crawler technology

Info

Publication number: CN112487268A
Application number: CN202011474920.9A
Authority: CN
Inventors: 陈绪龙; 王军凯; 卢文琳
Original assignee: Anhui Cnbisoft Software Technology Co ltd
Current assignee: Anhui Cnbisoft Software Technology Co ltd
Priority date: 2020-12-14
Filing date: 2020-12-14
Publication date: 2021-03-12

Abstract

The invention discloses a data crawling implementation method based on a distributed crawler technology, which relates to the technical field of data crawling and comprises the following steps: s1, appointing a URL, finding out the address needing to find the captured data finally according to the given address, and obtaining a corresponding index code; s2, initiating a request, splicing the website and judging whether the acquired website is the required captured data or not according to the codes acquired in S1, if the acquired website is the data page, calling back to a detail page, and if not, continuously and circularly searching the data website; the method adopts focusing and scraping, regular json data are used in the scraping process, data frequency is a plurality of data such as year, month, quarter and the like, increment updating is carried out, the data are inserted into the Oracle database in batches, the content is processed and screened when webpage scraping is carried out, only webpage information related to requirements is guaranteed to be scraped as far as possible, and whether one character string is matched with a certain pattern or not is conveniently checked.

Description

Data crawling implementation method based on distributed crawler technology

Technical Field

The invention relates to the technical field of data crawling, in particular to a data crawling implementation method based on a distributed crawler technology.

Background

Several types of crawlers are common in the prior art: general, focused and incremental crawlers, which are commonly used in several data parsing methods, such as Bs4 and Xpath.

However, the general crawler only provides content (HTML, Word, PDF) related to text, but cannot provide multimedia files (music, pictures, video) and binary files (programs, scripts), the provided results are uniform, different search results cannot be provided for people in different background fields, human semantic retrieval cannot be understood, and bs4 can only parse HTML-format data.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a data crawling implementation method based on a distributed crawler technology, and solves the problems mentioned in the background technology.

In order to achieve the purpose, the invention is realized by the following technical scheme: a data crawling implementation method based on a distributed crawler technology comprises the following steps:

s1, appointing a URL, finding out the address needing to find the captured data finally according to the given address, and obtaining a corresponding index code;

s2, initiating a request, splicing the website and judging whether the acquired website is the required captured data or not according to the codes acquired in S1, if the acquired website is the data page, calling back to a detail page, and if not, continuously and circularly searching the data website;

s3, acquiring data, splicing corresponding parameters in the website, extracting the data at regular time, adding a date parameter into the website, adding a region-dividing parameter into the region-dividing data, deducting a corresponding data type from start-urls, and judging which grabbing process is to be carried out;

s4, analyzing data, converting the character string into json format data by using json loads, acquiring units and regions in a circulating mode, and storing the units and the regions in a dictionary form, so that the problem that the data, the units and the regions are not together and cannot be normal is solved;

s5, performing persistent storage, capturing by using a script framework, defining fields to be used and corresponding classes in item.

In the above method for implementing data crawling based on distributed crawler technology, preferably, in S1, the user clicks options such as month, quarter, year, and the like in a nation website to find that data changes with the others, and finds that, according to the industry code, the network preview finds that, according to the industry code, the first-level iteration indicates that the index code loaded on the page is not real data, and when the index code is false, the index code is a leaf node, and the last data of each industry is an address beginning with the above address, and a corresponding index code is obtained from the address, so as to facilitate website splicing of next data crawling.

In the above method for implementing data crawling based on distributed crawler technology, preferably, in S3, the data are all in json format, the extraction rules are not completely the same, different types of data websites are respectively transmitted at the beginning, the corresponding types of captured data are deducted by using a rule, different code branches are taken, meanwhile, the data of different regions are captured in advance, stored in a dictionary form, and transmitted into the URL in a parameter form, so that different types of data are captured and different branches are taken, and the same part is not rewritten.

In the above method for implementing data crawling based on the distributed crawler technology, preferably, in S4, the data obtained each time in the loop is stored in a tuple form and added to the list, so as to obtain batch yield data outside the loop, and when there is a page turn, if the page is not overlapped, the next page cannot be normally captured, and the parameter dot _ filter needs to be modified to True.

Compared with the prior art, the invention has the following beneficial effects: adopt focus to snatch, scrappy, used regular in the snatch process, the json data, data frequency is a plurality of data such as year, month, quarter, and the increment is updated, inserts the Oracle database in batches, can handle the screening to the content when implementing webpage snatchs, guarantees only to snatch the webpage information relevant with the demand as far as possible, and whether convenient inspection a character string matches with certain mode.

Detailed Description

The technical solutions in the embodiments of the present invention are clearly and completely described below, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The invention provides a technical scheme that: a data crawling implementation method based on a distributed crawler technology adopts focused crawling (a web crawler program facing to specific subject needs), the focused crawler processes and screens contents when webpage crawling is carried out, only webpage information related to needs is guaranteed to be grabbed as far as possible), scrappy, regular expression is used in the crawling process (the regular expression is a special character sequence and can conveniently check whether a character string is matched with a certain pattern), json data, data frequency is year, month, quarter and other data, increment is updated, and an Oracle database is inserted in batches, and the method comprises the following steps:

s5, persistent storage, capturing by using a script frame, defining fields to be used and corresponding classes in item.

In S1, click the options such as month, quarter, year, etc. in the nation website to find that the data is changed, find in the networkpreview, according to the industry code, the iteration of the first level, the isParent is true to indicate the index code loaded on the page, not the real data, when isParent is false, the isParent is the leaf node, the last data of each industry, the above address is the address of the beginning, the corresponding index code is obtained from the address, which is convenient for the website splicing of the next data capture.

In S3, the data are all in json format, the extraction rules are not completely the same, different types of data websites are respectively transmitted into the data at the beginning, corresponding captured data types are deducted by regular rules, different code branches are taken, meanwhile, the data in different regions are captured in advance, stored in a dictionary form and transmitted into URL in a parameter form, thus different types of data are captured and different branches are taken, and the same parts are not repeated.

In S4, in the loop, the data acquired each time is stored in tuple form and added to the list, so that batch yield data outside the loop is obtained, and if there is a page turn, if there is no overlap of masks, the next page cannot be normally captured, and the parameter dot _ filter needs to be modified to True.

If a return appears in def and jumps out of def, if the return exists, statements below finish are continuously executed, access is too fast in the capture process, so that the pressure of a server of the opposite side is too high, or the return is identified as a capture program, so that final capture failure is caused, browser access needs to be simulated, ua camouflage is used, setting is used for adding request header information, time is used for sleep (2) and the like, the request speed is reduced, browser access is simulated, multi-user access is simulated by using an agent and the like, when the return is captured in full amount, a URL (uniform resource locator) can not be used for transferring more parameters, a website is captured and placed into corresponding txt, when a request is initiated next time, whether the return is repeated or not is judged from txt, and if the return is repeatedly skipped, the return is not repeated, the capture is carried out and the like.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. A data crawling implementation method based on a distributed crawler technology is characterized by comprising the following steps: the method comprises the following steps:

2. The data crawling implementation method based on the distributed crawler technology as claimed in claim 1, wherein: in S1, click the options such as month, quarter, and year in the nation website, find that the data is changed with the change, find in the networkpreview, according to the industry code, the iteration of the first level, the isParent indicates that the index code loaded on the page is not real data, and when isParent indicates false, the isParent is a leaf node, and is the final data of each industry, the above address is the address from which the above address starts, and the corresponding index code is obtained from the address, so that the website splicing for the next data capture is facilitated.

3. The data crawling implementation method based on the distributed crawler technology as claimed in claim 1, wherein: in the step S3, the data are all in json format, the extraction rules are not completely the same, different types of data websites are respectively transmitted at the beginning, the corresponding captured data types are deducted by using the regular pattern, different code branches are taken, meanwhile, the data of different regions are captured in advance, stored in the form of dictionary, and transmitted into the URL in the form of parameters, so that different types of data are captured and different branches are taken, and the same parts are not rewritten.

4. The data crawling implementation method based on the distributed crawler technology as claimed in claim 3, wherein: in S4, in the loop, the data acquired each time is stored in tuple form and added to the list, so that batch yield data outside the loop is obtained, and when there is page turning, if the page is not repeatedly masked, the next page cannot be normally captured, and the parameter don _ filter needs to be modified to True.