CN112487268A - Data crawling implementation method based on distributed crawler technology - Google Patents

Data crawling implementation method based on distributed crawler technology Download PDF

Info

Publication number
CN112487268A
CN112487268A CN202011474920.9A CN202011474920A CN112487268A CN 112487268 A CN112487268 A CN 112487268A CN 202011474920 A CN202011474920 A CN 202011474920A CN 112487268 A CN112487268 A CN 112487268A
Authority
CN
China
Prior art keywords
data
website
captured
page
crawling
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011474920.9A
Other languages
Chinese (zh)
Inventor
陈绪龙
王军凯
卢文琳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Anhui Cnbisoft Software Technology Co ltd
Original Assignee
Anhui Cnbisoft Software Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Anhui Cnbisoft Software Technology Co ltd filed Critical Anhui Cnbisoft Software Technology Co ltd
Priority to CN202011474920.9A priority Critical patent/CN112487268A/en
Publication of CN112487268A publication Critical patent/CN112487268A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a data crawling implementation method based on a distributed crawler technology, which relates to the technical field of data crawling and comprises the following steps: s1, appointing a URL, finding out the address needing to find the captured data finally according to the given address, and obtaining a corresponding index code; s2, initiating a request, splicing the website and judging whether the acquired website is the required captured data or not according to the codes acquired in S1, if the acquired website is the data page, calling back to a detail page, and if not, continuously and circularly searching the data website; the method adopts focusing and scraping, regular json data are used in the scraping process, data frequency is a plurality of data such as year, month, quarter and the like, increment updating is carried out, the data are inserted into the Oracle database in batches, the content is processed and screened when webpage scraping is carried out, only webpage information related to requirements is guaranteed to be scraped as far as possible, and whether one character string is matched with a certain pattern or not is conveniently checked.

Description

Data crawling implementation method based on distributed crawler technology
Technical Field
The invention relates to the technical field of data crawling, in particular to a data crawling implementation method based on a distributed crawler technology.
Background
Several types of crawlers are common in the prior art: general, focused and incremental crawlers, which are commonly used in several data parsing methods, such as Bs4 and Xpath.
However, the general crawler only provides content (HTML, Word, PDF) related to text, but cannot provide multimedia files (music, pictures, video) and binary files (programs, scripts), the provided results are uniform, different search results cannot be provided for people in different background fields, human semantic retrieval cannot be understood, and bs4 can only parse HTML-format data.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a data crawling implementation method based on a distributed crawler technology, and solves the problems mentioned in the background technology.
In order to achieve the purpose, the invention is realized by the following technical scheme: a data crawling implementation method based on a distributed crawler technology comprises the following steps:
s1, appointing a URL, finding out the address needing to find the captured data finally according to the given address, and obtaining a corresponding index code;
s2, initiating a request, splicing the website and judging whether the acquired website is the required captured data or not according to the codes acquired in S1, if the acquired website is the data page, calling back to a detail page, and if not, continuously and circularly searching the data website;
s3, acquiring data, splicing corresponding parameters in the website, extracting the data at regular time, adding a date parameter into the website, adding a region-dividing parameter into the region-dividing data, deducting a corresponding data type from start-urls, and judging which grabbing process is to be carried out;
s4, analyzing data, converting the character string into json format data by using json loads, acquiring units and regions in a circulating mode, and storing the units and the regions in a dictionary form, so that the problem that the data, the units and the regions are not together and cannot be normal is solved;
s5, performing persistent storage, capturing by using a script framework, defining fields to be used and corresponding classes in item.
In the above method for implementing data crawling based on distributed crawler technology, preferably, in S1, the user clicks options such as month, quarter, year, and the like in a nation website to find that data changes with the others, and finds that, according to the industry code, the network preview finds that, according to the industry code, the first-level iteration indicates that the index code loaded on the page is not real data, and when the index code is false, the index code is a leaf node, and the last data of each industry is an address beginning with the above address, and a corresponding index code is obtained from the address, so as to facilitate website splicing of next data crawling.
In the above method for implementing data crawling based on distributed crawler technology, preferably, in S3, the data are all in json format, the extraction rules are not completely the same, different types of data websites are respectively transmitted at the beginning, the corresponding types of captured data are deducted by using a rule, different code branches are taken, meanwhile, the data of different regions are captured in advance, stored in a dictionary form, and transmitted into the URL in a parameter form, so that different types of data are captured and different branches are taken, and the same part is not rewritten.
In the above method for implementing data crawling based on the distributed crawler technology, preferably, in S4, the data obtained each time in the loop is stored in a tuple form and added to the list, so as to obtain batch yield data outside the loop, and when there is a page turn, if the page is not overlapped, the next page cannot be normally captured, and the parameter dot _ filter needs to be modified to True.
Compared with the prior art, the invention has the following beneficial effects: adopt focus to snatch, scrappy, used regular in the snatch process, the json data, data frequency is a plurality of data such as year, month, quarter, and the increment is updated, inserts the Oracle database in batches, can handle the screening to the content when implementing webpage snatchs, guarantees only to snatch the webpage information relevant with the demand as far as possible, and whether convenient inspection a character string matches with certain mode.
Detailed Description
The technical solutions in the embodiments of the present invention are clearly and completely described below, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The invention provides a technical scheme that: a data crawling implementation method based on a distributed crawler technology adopts focused crawling (a web crawler program facing to specific subject needs), the focused crawler processes and screens contents when webpage crawling is carried out, only webpage information related to needs is guaranteed to be grabbed as far as possible), scrappy, regular expression is used in the crawling process (the regular expression is a special character sequence and can conveniently check whether a character string is matched with a certain pattern), json data, data frequency is year, month, quarter and other data, increment is updated, and an Oracle database is inserted in batches, and the method comprises the following steps:
s1, appointing a URL, finding out the address needing to find the captured data finally according to the given address, and obtaining a corresponding index code;
s2, initiating a request, splicing the website and judging whether the acquired website is the required captured data or not according to the codes acquired in S1, if the acquired website is the data page, calling back to a detail page, and if not, continuously and circularly searching the data website;
s3, acquiring data, splicing corresponding parameters in the website, extracting the data at regular time, adding a date parameter into the website, adding a region-dividing parameter into the region-dividing data, deducting a corresponding data type from start-urls, and judging which grabbing process is to be carried out;
s4, analyzing data, converting the character string into json format data by using json loads, acquiring units and regions in a circulating mode, and storing the units and the regions in a dictionary form, so that the problem that the data, the units and the regions are not together and cannot be normal is solved;
s5, persistent storage, capturing by using a script frame, defining fields to be used and corresponding classes in item.
In S1, click the options such as month, quarter, year, etc. in the nation website to find that the data is changed, find in the networkpreview, according to the industry code, the iteration of the first level, the isParent is true to indicate the index code loaded on the page, not the real data, when isParent is false, the isParent is the leaf node, the last data of each industry, the above address is the address of the beginning, the corresponding index code is obtained from the address, which is convenient for the website splicing of the next data capture.
In S3, the data are all in json format, the extraction rules are not completely the same, different types of data websites are respectively transmitted into the data at the beginning, corresponding captured data types are deducted by regular rules, different code branches are taken, meanwhile, the data in different regions are captured in advance, stored in a dictionary form and transmitted into URL in a parameter form, thus different types of data are captured and different branches are taken, and the same parts are not repeated.
In S4, in the loop, the data acquired each time is stored in tuple form and added to the list, so that batch yield data outside the loop is obtained, and if there is a page turn, if there is no overlap of masks, the next page cannot be normally captured, and the parameter dot _ filter needs to be modified to True.
If a return appears in def and jumps out of def, if the return exists, statements below finish are continuously executed, access is too fast in the capture process, so that the pressure of a server of the opposite side is too high, or the return is identified as a capture program, so that final capture failure is caused, browser access needs to be simulated, ua camouflage is used, setting is used for adding request header information, time is used for sleep (2) and the like, the request speed is reduced, browser access is simulated, multi-user access is simulated by using an agent and the like, when the return is captured in full amount, a URL (uniform resource locator) can not be used for transferring more parameters, a website is captured and placed into corresponding txt, when a request is initiated next time, whether the return is repeated or not is judged from txt, and if the return is repeatedly skipped, the return is not repeated, the capture is carried out and the like.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.
Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims (4)

1. A data crawling implementation method based on a distributed crawler technology is characterized by comprising the following steps: the method comprises the following steps:
s1, appointing a URL, finding out the address needing to find the captured data finally according to the given address, and obtaining a corresponding index code;
s2, initiating a request, splicing the website and judging whether the acquired website is the required captured data or not according to the codes acquired in S1, if the acquired website is the data page, calling back to a detail page, and if not, continuously and circularly searching the data website;
s3, acquiring data, splicing corresponding parameters in the website, extracting the data at regular time, adding a date parameter into the website, adding a region-dividing parameter into the region-dividing data, deducting a corresponding data type from start-urls, and judging which grabbing process is to be carried out;
s4, analyzing data, converting the character string into json format data by using json loads, acquiring units and regions in a circulating mode, and storing the units and the regions in a dictionary form, so that the problem that the data, the units and the regions are not together and cannot be normal is solved;
s5, performing persistent storage, capturing by using a script framework, defining fields to be used and corresponding classes in item.
2. The data crawling implementation method based on the distributed crawler technology as claimed in claim 1, wherein: in S1, click the options such as month, quarter, and year in the nation website, find that the data is changed with the change, find in the networkpreview, according to the industry code, the iteration of the first level, the isParent indicates that the index code loaded on the page is not real data, and when isParent indicates false, the isParent is a leaf node, and is the final data of each industry, the above address is the address from which the above address starts, and the corresponding index code is obtained from the address, so that the website splicing for the next data capture is facilitated.
3. The data crawling implementation method based on the distributed crawler technology as claimed in claim 1, wherein: in the step S3, the data are all in json format, the extraction rules are not completely the same, different types of data websites are respectively transmitted at the beginning, the corresponding captured data types are deducted by using the regular pattern, different code branches are taken, meanwhile, the data of different regions are captured in advance, stored in the form of dictionary, and transmitted into the URL in the form of parameters, so that different types of data are captured and different branches are taken, and the same parts are not rewritten.
4. The data crawling implementation method based on the distributed crawler technology as claimed in claim 3, wherein: in S4, in the loop, the data acquired each time is stored in tuple form and added to the list, so that batch yield data outside the loop is obtained, and when there is page turning, if the page is not repeatedly masked, the next page cannot be normally captured, and the parameter don _ filter needs to be modified to True.
CN202011474920.9A 2020-12-14 2020-12-14 Data crawling implementation method based on distributed crawler technology Pending CN112487268A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011474920.9A CN112487268A (en) 2020-12-14 2020-12-14 Data crawling implementation method based on distributed crawler technology

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011474920.9A CN112487268A (en) 2020-12-14 2020-12-14 Data crawling implementation method based on distributed crawler technology

Publications (1)

Publication Number Publication Date
CN112487268A true CN112487268A (en) 2021-03-12

Family

ID=74917079

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011474920.9A Pending CN112487268A (en) 2020-12-14 2020-12-14 Data crawling implementation method based on distributed crawler technology

Country Status (1)

Country Link
CN (1) CN112487268A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113485607A (en) * 2021-07-07 2021-10-08 中国搜索信息科技股份有限公司 Public general video capturing method

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040143787A1 (en) * 2002-06-19 2004-07-22 Constantine Grancharov Method and system for resolving universal resource locators (URLs) from script code
CN104978408A (en) * 2015-08-05 2015-10-14 许昌学院 Berkeley DB database based topic crawler system
CN105608134A (en) * 2015-12-18 2016-05-25 盐城工学院 Multithreading-based web crawler system and web crawling method thereof
CN109359231A (en) * 2017-12-29 2019-02-19 广州Tcl智能家居科技有限公司 A kind of information crawler method, server and the storage medium of distributed network crawler
CN109710830A (en) * 2018-12-28 2019-05-03 四川新网银行股份有限公司 A kind of distributed network crawler method and system based on browser plug-in

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040143787A1 (en) * 2002-06-19 2004-07-22 Constantine Grancharov Method and system for resolving universal resource locators (URLs) from script code
CN104978408A (en) * 2015-08-05 2015-10-14 许昌学院 Berkeley DB database based topic crawler system
CN105608134A (en) * 2015-12-18 2016-05-25 盐城工学院 Multithreading-based web crawler system and web crawling method thereof
CN109359231A (en) * 2017-12-29 2019-02-19 广州Tcl智能家居科技有限公司 A kind of information crawler method, server and the storage medium of distributed network crawler
CN109710830A (en) * 2018-12-28 2019-05-03 四川新网银行股份有限公司 A kind of distributed network crawler method and system based on browser plug-in

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
OSC_LU3HKB21: ""爬虫基础"", 《开源中国》, 19 March 2019 (2019-03-19), pages 1 - 38 *
中国税务学会: "《大数据技术与税收应用》", 30 June 2018, 北京:中国税务出版社, pages: 98 - 108 *
梦中泪: ""Python聚焦爬虫数据解析"", 《博客园》, 26 April 2020 (2020-04-26), pages 1 - 22 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113485607A (en) * 2021-07-07 2021-10-08 中国搜索信息科技股份有限公司 Public general video capturing method
CN113485607B (en) * 2021-07-07 2022-02-15 中国搜索信息科技股份有限公司 Public general video capturing method

Similar Documents

Publication Publication Date Title
US8868621B2 (en) Data extraction from HTML documents into tables for user comparison
US11550856B2 (en) Artificial intelligence for product data extraction
CN107707545B (en) Abnormal webpage access fragment detection method, device, equipment and storage medium
AU2013329525B2 (en) System and method for recursively traversing the internet and other sources to identify, gather, curate, adjudicate, and qualify business identity and related data
US11444977B2 (en) Intelligent signature-based anti-cloaking web recrawling
CN112749284B (en) Knowledge graph construction method, device, equipment and storage medium
CN109376291B (en) Website fingerprint information scanning method and device based on web crawler
CN108090104B (en) Method and device for acquiring webpage information
Cheung et al. Development nature matters: An empirical study of code clones in JavaScript applications
CN106570750B (en) Browser plug-in-based automatic tax declaring method and system and browser plug-in
US11263062B2 (en) API mashup exploration and recommendation
US11443006B2 (en) Intelligent browser bookmark management
CN112445997A (en) Method and device for extracting CMS multi-version identification feature rule
CN113569181A (en) Page data acquisition method and system
CN113254751A (en) Method, equipment and storage medium for accurately extracting complex webpage structured information
CN112612943A (en) Asynchronous processing framework-based data crawling method with automatic testing function
CN113918794B (en) Enterprise network public opinion benefit analysis method, system, electronic equipment and storage medium
CN114443928B (en) Web text data crawler method and system
US11295078B2 (en) Portfolio-based text analytics tool
CN111125485A (en) Website URL crawling method based on Scapy
CN112487268A (en) Data crawling implementation method based on distributed crawler technology
EP2691874A1 (en) Textual analysis system
US20220207049A1 (en) Methods, devices and systems for processing and analysing data from multiple sources
CN114036266A (en) Intelligent strategy volume-combining method, device and equipment based on natural language processing
CN113806667A (en) Method and system for supporting webpage classification

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20210312

RJ01 Rejection of invention patent application after publication