CN111339388B - Information crawling system - Google Patents

Information crawling system Download PDF

Info

Publication number
CN111339388B
CN111339388B CN201910510474.3A CN201910510474A CN111339388B CN 111339388 B CN111339388 B CN 111339388B CN 201910510474 A CN201910510474 A CN 201910510474A CN 111339388 B CN111339388 B CN 111339388B
Authority
CN
China
Prior art keywords
crawling
crawled
links
information
intelligent
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910510474.3A
Other languages
Chinese (zh)
Other versions
CN111339388A (en
Inventor
胡崇海
熊友根
王洪涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Haitong Securities Co ltd
Original Assignee
Haitong Securities Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Haitong Securities Co ltd filed Critical Haitong Securities Co ltd
Priority to CN201910510474.3A priority Critical patent/CN111339388B/en
Publication of CN111339388A publication Critical patent/CN111339388A/en
Application granted granted Critical
Publication of CN111339388B publication Critical patent/CN111339388B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

An information crawling system, comprising: the intelligent crawling nodes are deployed on a plurality of dialing virtual special servers, and dynamically switch the IP addresses of the dialing virtual special servers based on information crawling time and information crawling states; and the resource scheduling system distributes links to be crawled to the intelligent crawling nodes based on a scheduling algorithm so as to enable the intelligent crawling nodes to execute crawling operation, and receives crawling results from the intelligent crawling nodes. By the technical scheme provided by the invention, a stable crawler system can be established, a back-climbing strategy can be effectively responded, and continuous acquisition of data is realized.

Description

Information crawling system
Technical Field
The invention relates to the technical field of networks, in particular to an information crawling system.
Background
Information crawling is a main means for acquiring network information, and a large number of business requirements exist. Under the general condition, in order to realize continuous and stable acquisition of information, a crawler system which can effectively cope with a reverse-crawling strategy and can efficiently and stably operate needs to be established.
The existing crawler system is limited by factors such as a computing platform and regions, and the existing anti-climbing line is difficult to break through effectively. For example, the traditional open source Crawler tool, i.e., Crawler4j, Crawler WebMagic, etc., are difficult to evade reverse crawling determination based on Internet Protocol (IP for short) and regions, and are easily and quickly locked by a crawled party. And the manufacturer who does not open the source is a crawler tool, the charge is high, and the requirements of high stability, high customization and strong monitoring of industrial crawling under the complex network environment are difficult to adapt.
In this case, part of the crawler system adds a technical module for reverse crawling to crawl information by using proxy IP. However, crawling is inefficient because it is difficult to obtain a large number of available IP addresses of truly high quality and it is easy for a crawled party (e.g., a website party) to lock an IP source by means of IP backtracking. In addition, the crawler agent crawler of the existing crawling application program framework provides a high-quality IP address, but the charging is high, the related IP address is located abroad, the internet speed for accessing the domestic website is low, the crawling agent cannot be effectively fused with a crawling function (Scrapy-splash) for supporting Java Script (JS) rendering, and information crawling of a dynamically rendered page cannot be realized.
How to realize continuous and stable acquisition of data in a complex network environment needs further research.
Disclosure of Invention
The technical problem solved by the invention is how to crawl data so as to effectively cope with a reverse-crawling strategy and realize stable acquisition of the data.
In order to solve the above technical problem, an embodiment of the present invention provides an information crawling system, including: the intelligent crawling nodes are deployed on a plurality of dialing virtual special servers, and dynamically switch the IP addresses of the dialing virtual special servers based on information crawling time and information crawling states; and the resource scheduling system distributes links to be crawled to the intelligent crawling nodes based on a scheduling algorithm so as to enable the intelligent crawling nodes to execute crawling operation, and receives crawling results from the intelligent crawling nodes.
Optionally, the intelligent crawling node dynamically opens or closes the web engine based on the memory capacity of the deployed dial-up virtual dedicated server and the amount of the crawled information.
Optionally, the intelligent crawling node accesses the link to be crawled based on the web engine, and dynamically renders the web source code associated with the link to be crawled in the web engine.
Optionally, the crawling result contains a plurality of links to be crawled, and the information crawling system further includes: and the information analysis and management system is suitable for extracting target information from the crawling result and analyzing the plurality of links to be crawled from the crawling result.
Optionally, the information crawling system further includes: and the URL distribution system is suitable for receiving the link to be crawled sent by the information analysis and management system and distributing the link to the resource scheduling system.
Optionally, the information analysis and management system extracts the target information from the crawling result based on a preconfigured extraction rule of the target information.
Optionally, the link to be crawled includes the link to be crawled that does not have embedded link and the link to be crawled that has embedded connection, resource scheduling system is suitable for the priority with the link to be crawled that does not have embedded link is regarded as the link to be crawled, and distributes to the node is crawled to intelligence.
Optionally, the resource scheduling system is adapted to add the links to be crawled into a candidate link set, and perform deduplication processing on each link in the candidate link set.
Optionally, the resource scheduling system performs deduplication processing on each link in the candidate link set based on a BerkeleyDB persistent storage technology.
Compared with the prior art, the technical scheme of the embodiment of the invention has the following beneficial effects:
an embodiment of the present invention provides an information crawling system, including: the intelligent crawling nodes are deployed on a plurality of dialing virtual special servers, and dynamically switch the IP addresses of the dialing virtual special servers based on information crawling time and information crawling states; and the resource scheduling system distributes links to be crawled to the intelligent crawling nodes based on a scheduling algorithm so as to enable the intelligent crawling nodes to execute crawling operation, and receives crawling results from the intelligent crawling nodes. The embodiment of the invention simulates real person access by dynamically switching the IP address of the dialing virtual special server through the intelligent crawling node, can effectively cope with the identification strategy in anti-crawling, and further effectively avoids being identified by a crawling party, thereby realizing stable data acquisition in a complex network environment. Furthermore, each intelligent crawling node is flexibly scheduled based on the scheduling algorithm to execute crawling operation, on one hand, the crawling data amount of each intelligent crawling node can be balanced, and on the other hand, possibility is provided for realizing dynamic capacity adjustment. Further, compared with the crawler agent crawler of the script application framework, the embodiment of the invention adopts the dialing virtual special server to switch the IP address, so that the cost is lower and the speed is higher.
Further, the intelligent crawling node accesses the link to be crawled based on the webpage engine and dynamically renders webpage source codes associated with the link to be crawled in the webpage engine. According to the embodiment of the invention, based on the webpage engine accessing the link to be crawled, the page can be really opened at the intelligent crawling node, so that dynamic rendering source codes are obtained, and dynamic webpage generation strategies such as rendering and the like are effectively responded.
Further, the information crawling system further comprises: and the URL distribution system is suitable for receiving the link to be crawled sent by the information analysis and management system and distributing the link to the resource scheduling system. The embodiment of the invention can timely adjust the number of the crawling links of each intelligent crawling node based on the URL distribution system, and further provides a feasible scheme for realizing dynamic capacity adjustment.
Further, the link to be crawled is divided into the link to be crawled without embedded links and the link to be crawled with embedded connections, and the resource scheduling system is suitable for preferentially taking the link to be crawled without embedded links as the link to be crawled and distributing the link to the intelligent crawling node. According to the embodiment of the invention, the link to be crawled which does not generate a new link is preferentially crawled, so that the link stock is in a reasonable range, and the link to be crawled is effectively prevented from being overstocked to cause system crash.
Drawings
FIG. 1 is a schematic structural diagram of an information crawling system according to an embodiment of the present invention;
fig. 2 is a flowchart illustrating an information crawling method according to an embodiment of the present invention;
FIG. 3 is a schematic structural diagram of another information crawling system according to an embodiment of the present invention.
Detailed Description
As understood by those skilled in the art, as for the background art, both the open-source crawler tool and the non-open-source crawler tool are difficult to effectively deal with the anti-crawling, and stable data acquisition cannot be realized.
Specifically, a large number of crawling systems (also called crawling frames) exist in the field of information crawling, the crawling systems are mainly divided into an open source crawling framework and an unopened source crawling framework, the open source crawling framework such as Crawler4j and WebMaxic mainly focuses on information extraction, effective anti-crawling support cannot be provided, and the crawling framework is easily locked by a crawling website. A frame which is not opened for crawling, such as octopus and the like, adopts visual page operation, is difficult to deal with complex webpage environments, is difficult to meet industrial crawling requirements in the aspects of crawling monitoring, continuous crawling, batch deployment, data customized acquisition and the like, and is more suitable for data acquisition of small research projects.
In the traditional anti-crawling strategies such as the proxy IP and the like, because a large amount of high-quality IP required by crawling is difficult to obtain and the IP is easy to trace the source, the real original crawling IP is locked, and the crawling effect of the proxy IP is not ideal. Although the mature script crawling frame can provide a high-quality proxy IP, the proxy IP is high in charge and mostly is a foreign IP address, the access speed of a domestic website is low, the website is easy to limit, and the JS webpage dynamic rendering based on the IP proxy cannot be realized.
An embodiment of the present invention provides an information crawling system, including: the intelligent crawling nodes are deployed on a plurality of dialing virtual special servers, and dynamically switch the IP addresses of the dialing virtual special servers based on information crawling time and information crawling states; and the resource scheduling system distributes links to be crawled to the intelligent crawling nodes based on a scheduling algorithm so as to enable the intelligent crawling nodes to execute crawling operation, and receives crawling results from the intelligent crawling nodes.
The embodiment of the invention simulates real person access by dynamically switching the IP address of the dialing virtual special server through the intelligent crawling node, can effectively cope with the identification strategy in anti-crawling, and further effectively avoids being identified by a crawling party, thereby realizing stable data acquisition in a complex network environment.
Furthermore, each intelligent crawling node is flexibly scheduled based on the scheduling algorithm to execute crawling operation, on one hand, the crawling data amount of each intelligent crawling node can be balanced, and on the other hand, possibility is provided for realizing dynamic capacity adjustment.
Further, compared with the crawler agent crawler of the script application framework, the embodiment of the invention adopts the dialing virtual special server to switch the IP address, so that the cost is lower and the speed is higher.
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below.
A Uniform Resource Locator (URL) is a compact representation of the location and access method of a Resource available from the internet, and is an address of a standard Resource on the internet, for example, a link corresponds to a URL in this document.
A crawler in this context refers to a program or script that automatically crawls web information according to certain rules.
Anti-crawling in this context refers to a prevention policy or prevention program for crawlers of web page source code by a crawled website.
Proxy IP in this context refers to the IP address of the proxy server.
The JS dynamic rendering herein refers to page rendering, and refers to a process in which a rendering engine of a browser displays hypertext Markup Language (HTML) codes in a browser window according to rules defined by Cascading Style Sheets (CSSs).
URL deduplication herein refers to removing duplicate URL addresses;
an IP pool herein refers to a set of available IPs.
An extensible markup language Path (XPATH) herein refers to a language used to determine a location in an extensible markup language (XML) document.
A Virtual Private Server (VPS) herein refers to: and a plurality of virtual exclusive servers which are divided from one server.
The dial-up VPS is also referred to herein as Asymmetric Digital Subscriber Loop (ADSL) dial-up server, and refers to: VPS virtual server based on ADSL dynamic dialing.
Redis herein refers to an Application Programming Interface (API) for multiple languages written in ANSI C language, supporting network, providing an open-source log-type and key-value database based on memory and persistence.
Fig. 1 is a schematic structural diagram of an information crawling system according to an embodiment of the present invention. The information crawling system 1 may crawl web page source codes from a website.
Specifically, the information crawling system 1 may include a resource scheduling system 101 and a plurality of intelligent crawling nodes 102. The plurality of intelligent crawling nodes 102 may dynamically switch IP addresses; the resource scheduling system 101 may assign links to be crawled to the plurality of intelligent crawling nodes 102 to cause them to perform crawling operations, and receive crawling results from the respective intelligent crawling nodes 102. The crawl result open source includes web page source file information (e.g., web page source code).
In a specific implementation, the resource scheduling system 101 may be deployed in a remote server or a cloud server. The remote server or cloud server may be a single server or a server cluster consisting of multiple servers.
Those skilled in the art understand that a dial-up Virtual Private Server (VPS) has a public network IP address and can connect out an IP pool. The IP address can be automatically changed after the dialing VPS is restarted every time, and the server where the dialing VPS is located can be deployed in any geographic area, so that the server has natural regional advantages and IP advantages, and can simulate the internet surfing operation of a user more truly.
In consideration of multiple factors such as the authenticity, the availability, and the low cost of the region and the external IP pool, in a specific implementation, the present embodiment deploys the intelligent crawling nodes 102 on the dial-up VPS, and establishes a crawling engine for each intelligent crawling node 102, so that each intelligent crawling node 102 performs a crawling operation based on the crawling engine. Because the dial-up VPS has the characteristic of automatically switching IP addresses, the intelligent crawling node 102 deployed in the dial-up VPS can effectively cope with the anti-crawling strategy of a crawled party, so that a crawling result, such as webpage source file information, is obtained.
In specific implementation, for balancing the crawling quantity and the memory consumption, and the capacity and the memory of each intelligent crawling node 102, it is ensured that the capacity and the memory are always within a controllable range, before assigning the link to be crawled, the resource scheduling system 101 can comprehensively consider the quantity of the intelligent crawling nodes 102 (for example, the idle intelligent crawling nodes 102), the historical crawling records of the same domain name, the geographical distribution of the intelligent crawling nodes 102, and the conditions such as control of the intelligent crawling nodes 102, so that the dynamic scheduling of the intelligent crawling nodes 102 and the remote transmission of crawling results are realized, and the stable operation of the intelligent crawling nodes 102 is ensured.
In one embodiment, the remote server or the cloud server may select at least a part of the idle intelligent crawling nodes 102 from the resource scheduling system 101 to assign the links to be crawled. And then, at least one part of intelligent crawling nodes can crawl webpage source codes contained in the links to be crawled.
In a specific implementation, the intelligent crawling node 102 may dynamically open or close a web engine based on the memory capacity of the deployed dial-up VPS and the amount of information crawled.
In a specific implementation, the intelligent crawling node 102 may access a link to be crawled based on the web page (web) engine, so as to implement an anthropomorphic internet access operation. The web page engine may be configured to dynamically render the web page source code associated with the link to be crawled. Further, the intelligent crawling node 102 may obtain dynamically rendered associated web page source code.
After the intelligent crawling node 102 extracts the web page source file information (e.g., web page source code) from the link to be crawled, the intelligent crawling node 102 may send the web page source file information to a cloud server or a remote server, and extract target information and a link included in the web page source file information from the cloud server or the remote server (e.g., the link to be crawled has an embedded link). Then, the cloud server or the remote server may operate the resource scheduling system 101 to perform deduplication and scheduling on the extracted links to crawl more data.
Further, the information crawling system 1 may further include an information parsing and management system (not shown) adapted to extract target information from the crawling result and parse the plurality of links to be crawled from the crawling result.
In a specific implementation, the information parsing and management system extracts the target information from the crawling result based on a preconfigured extraction rule of the target information.
In particular implementations, the information parsing and management system may extract the target information from the crawl results based on preconfigured extraction rules for the target information. When various webpages are faced, the information analysis and management system can ensure the freedom degree of customization of users, and the module packaging requirement is realized to the maximum extent, so that the users only need to pay attention to the rule customization of specific webpages, and do not need to pay attention to how the background is realized.
In another embodiment, the intelligent crawling node 102 may extract target information from the crawling result based on preconfigured extraction rules for the target information. In a specific implementation, the intelligent crawling node 102 may extract the target information from the crawling result based on an XPATH language.
Further, the information crawling system 1 may further include a URL assignment system (not shown). The URL distribution system is adapted to receive the link to be crawled sent by the information parsing and management system and distribute the link to the resource scheduling system 101.
In a specific implementation, the links to be crawled may include links to be crawled without embedded links and links to be crawled with embedded connections, and the resource scheduling system 101 is adapted to preferentially use the links to be crawled without embedded links as the links to be crawled and allocate the links to the intelligent crawling node 102.
Specifically, the resource scheduling system 101 may add the links to be crawled into a candidate link set, perform deduplication processing on each link in the candidate link set, and then schedule each deduplicated link by the URL distribution system.
In one embodiment, the resource scheduling system 101 may perform de-duplication on each link in the candidate link set based on berkeley (berkeley db) persistent storage technology, so as to determine whether each link has been crawled, and may perform de-duplication on each link in the candidate link set. If a link has been crawled, the link may be deleted to avoid repeated crawls. For the un-deleted and de-duplicated links, the links can be stored in the candidate link set for subsequent crawling. In one embodiment, the candidate link set may be stored by Redis, or may be stored locally by the remote server or cloud server.
Those skilled in the art understand that, in practical applications, the resource scheduling system 101 may perform deduplication on each crawled link, and then add each deduplicated link into the candidate link set. It should be noted that, on the basis of the deduplication performed by the resource scheduling system 101, an incremental link judgment mechanism may also be established to implement incremental link crawling and crawling stop.
Thereafter, the resource scheduling system 101 may use at least a part of the links in the candidate link set as links to be crawled next time.
Those skilled in the art will appreciate that in most cases, the source file information of the web page crawled by the intelligent crawling node 102 will contain a large number of embedded links. Since the crawling processing speed of each link is limited, when the resource scheduling system 101 schedules each link, the stock links in the candidate link set need to be assigned.
Specifically, the cloud server or the remote server may determine whether the links added to the candidate link set further include embedded links. For example, the set of candidate links may include a plurality of second links to be crawled without embedded links and a plurality of first links to be crawled with embedded links. At this moment, the URL scheduling module may preferentially use the second link to be crawled as a link to be crawled next time, and allocate the link to one intelligent crawling node 102 or a plurality of intelligent crawling nodes 102, so that the stock link may be controlled to be located within a suitable threshold, thereby preventing a large amount of backlog of the link to be crawled, and avoiding causing system crash.
In a specific implementation, when the resource scheduling system 101 assigns the links to be crawled, a directed graph between the links may be established through a crawling process of the links to be crawled, and a weight is set according to a link hierarchy. When the number of stored links is less than a first preset threshold, the stored links may be scheduled based on a backward weight algorithm; when the number of stored links is greater than or equal to a first preset threshold, the stored links may be scheduled based on a forward weighting algorithm.
In one embodiment, for all the intelligent crawling nodes 102, if the resource scheduling system 101 suspends assigning the link to be crawled for a certain intelligent crawling node 102, the intelligent crawling node 102 may suspend crawling, and may shut down and restart to switch IP addresses.
In another embodiment, for all dial-up VPSs, if the resource scheduling system 101 reassigns the link to be crawled to a certain intelligent crawling node 102, the intelligent crawling node 102 automatically changes an IP address, and crawls the web page source file information associated with the link to be crawled based on the changed IP address.
Fig. 2 is a flowchart illustrating an information crawling method according to an embodiment of the present invention. The information crawling method can be executed by a cloud server or a remote server. Specifically, the information crawling method may include the steps of:
step S201, determining a link to be crawled;
step S202, distributing the links to be crawled to a plurality of intelligent crawling nodes based on a scheduling algorithm so as to enable the intelligent crawling nodes to execute crawling operation;
and step S203, receiving the crawling results from the intelligent crawling nodes.
The intelligent crawling nodes are deployed on a plurality of dialing virtual special servers, and the IP addresses of the dialing virtual special servers are dynamically switched based on information crawling time and information crawling states.
More specifically, the cloud server or the remote server may deploy a plurality of intelligent crawling nodes 102 based on a dial-up VPS, and establish a crawling engine at each intelligent crawling node 102.
In step S201, a link to be crawled may be determined. Those skilled in the art will appreciate that, in the initial stage, the links to be crawled may be preconfigured URLs such as www.sina.com.cn, www.baidu.com, and so on.
In step S202, the links to be crawled may be assigned to each intelligent crawling node 102 or part of the intelligent crawling nodes 102 based on a scheduling algorithm so as to perform crawling operation.
And then, each intelligent crawling node 102 receiving the link to be crawled can execute crawling operation and report a crawling result to the cloud server or the remote server.
The cloud server or the remote server may obtain the crawling results from each of the intelligent crawling nodes 102 in step S203. The crawl results may include web page source file information.
In one embodiment, the web page source file information may contain new links to be crawled. For example, the links to be crawled assigned to the intelligent crawling node 102 can be classified into links to be crawled with embedded links and links to be crawled without embedded links. If the link to be crawled is a link to be crawled with embedded links, the link to be crawled can be added into a candidate link set.
Preferably, before adding the to-be-crawled link with the embedded link into the candidate link set, the to-be-crawled link may be subjected to deduplication processing, and then the deduplicated to-be-crawled link is added into the candidate link set.
In one embodiment, the deduplication processing may be performed on each link in the candidate link set based on a BerkeleyDB persistent storage technique.
In another embodiment, if the link to be crawled includes a link to be crawled that does not have embedded links, the second link to be crawled may be preferentially assigned to at least some of the intelligent crawling nodes 102 of the plurality of intelligent crawling nodes 102.
For the intelligent crawling node 102, if the received links to be crawled include a first link to be crawled having an embedded link and a second link to be crawled not having an embedded link, the intelligent crawling node 102 may preferentially perform a crawling operation on the first link to be crawled. And then, performing crawling operation on the second link to be crawled.
Those skilled in the art will appreciate that target information may also be extracted from the web page source file information. Specifically, the target information may be extracted from the crawling result based on a preconfigured extraction rule of the target information. In one embodiment, the target information may be extracted based on an XPATH language.
FIG. 3 is a schematic structural diagram of another information crawling system according to an embodiment of the present invention. The information crawling system 2 may include a resource scheduling system 201 and a plurality of intelligent crawling nodes 202 (fig. 3 shows only one intelligent crawling node 202).
The resource scheduling system 201 may include a URL scheduling module 2011. In an initial stage, the resource scheduling system 201 may schedule preset links to be crawled based on the URL scheduling module 2011, and allocate each preset link to be crawled to each intelligent crawling node 202.
The intelligent crawling node 202 may acquire the allocated to-be-crawled link, perform crawling operation on the allocated to-be-crawled link based on the crawling engine 2021 and the web page engine 2022, upload a crawling result (for example, the web page source code 2031) to the information parsing and management system 203, and the information parsing and management system 203 may acquire target information, for example, process the target information by using an XPATH language, and extract required target information from the web page source code 2011.
Those skilled in the art will appreciate that in particular implementations, the intelligent crawling node 202 may reschedule the assigned links to be crawled based on the crawling engine 2021, preferentially crawling links that do not have embedded links.
Further, the information parsing and management system 203 may upload the obtained URL to the URL distribution system 204. The URL allocation system 204 is adapted to allocate the extracted link to be crawled to the resource scheduling system 201.
Further, the resource scheduling system 201 may obtain a plurality of links from the URL allocation system 204.
Further, the resource scheduling system 201 can perform the deduplication processing by the URL deduplication module 2013. Preferably, efficient deduplication with massive URLs can be achieved based on the Berkeley DB continuous storage technology. In addition, an increment judgment mechanism can be established on the basis of URL duplicate removal, and automatic increment crawling and crawling stopping are realized.
Further, the resource scheduling system 201 may add the deduplicated URL (i.e., the deduplicated link) to the candidate link set 2012.
Further, the resource scheduling system 201 may obtain a link to be crawled for the next time from the candidate link set 2012 based on the URL scheduling module 2011, and send the link to be crawled to each intelligent crawling node 202, so that the intelligent crawling node 202 performs the next crawling operation. Preferably, the URL scheduling module 2011 may preferentially schedule links without embedded links and assign links without embedded links to one intelligent crawling node 202 or multiple intelligent crawling nodes 202.
In this way, the embodiment of the invention establishes a dynamic intelligent crawling node by combining the dial-up VPS on the basis of exploring a reverse crawling mechanism, and realizes cross-region anthropomorphic access based on anthropomorphic rebroadcasting and restarting strategies, so that a crawled website cannot identify crawling operation, and the reverse crawling barrier of most websites is effectively broken through. The resource scheduling system can balance the crawling demand and the intelligent crawling nodes, and the resource scheduling system and the intelligent crawling nodes jointly form a dynamic crawling engine, so that efficient continuous operation of the dynamic crawling engine is guaranteed.
Furthermore, the information crawling system provided by the embodiment of the invention can effectively break through the anti-crawler barriers of most website systems, can realize continuous and stable crawling under a complex network environment at lower cost compared with the traditional open-source and closed-source crawler frame, and is suitable for the crawling requirement of industrial network information.
Further, the embodiment of the present invention further discloses a storage medium, on which computer instructions are stored, and when the computer instructions are executed, the technical solution of the method in the embodiment shown in fig. 2 is executed. Preferably, the storage medium may include a computer-readable storage medium such as a non-volatile (non-volatile) memory or a non-transitory (non-transient) memory. The computer readable storage medium may include ROM, RAM, magnetic or optical disks, and the like.
Further, an embodiment of the present invention further discloses a server, which includes a memory and a processor, where the memory stores computer instructions capable of being executed on the processor, and the processor executes the technical solution of the method in the embodiment shown in fig. 2 when executing the computer instructions. Specifically, the server may be a cloud server, and at least a resource scheduling system is deployed to schedule each intelligent crawling node to perform crawling operation.
Although the present invention is disclosed above, the present invention is not limited thereto. Various changes and modifications may be effected therein by one skilled in the art without departing from the spirit and scope of the invention as defined in the appended claims.

Claims (7)

1. An information crawling system, comprising:
the intelligent crawling nodes are deployed on a plurality of dialing virtual special servers, and dynamically switch the IP addresses of the dialing virtual special servers based on information crawling time and information crawling states;
the resource scheduling system distributes links to be crawled to the intelligent crawling nodes based on a scheduling algorithm so as to enable the intelligent crawling nodes to execute crawling operation, and receives crawling results from the intelligent crawling nodes;
the resource scheduling system comprehensively considers the number of intelligent crawling nodes, historical crawling records of the same domain name, regional distribution of the intelligent crawling nodes and concurrent intelligent crawling node control before allocating the links to be crawled, so that dynamic scheduling of the intelligent crawling nodes and remote transmission of crawling results are realized;
the intelligent crawling node dynamically opens or closes a webpage engine based on the memory capacity of the deployed dialing virtual special server and the crawling information quantity, the intelligent crawling node accesses the links to be crawled based on the webpage engine, and the intelligent crawling node dynamically renders webpage source codes related to the links to be crawled in the webpage engine.
2. The information crawling system of claim 1, wherein the crawl results contain a plurality of links to crawl, the information crawling system further comprising:
and the information analysis and management system is suitable for extracting target information from the crawling result and analyzing the plurality of links to be crawled from the crawling result.
3. The information crawling system of claim 2, further comprising:
and the URL distribution system is suitable for receiving the link to be crawled sent by the information analysis and management system and distributing the link to the resource scheduling system.
4. The information crawling system according to claim 2, wherein the information parsing and management system extracts the target information from the crawling result based on a preconfigured extraction rule of the target information.
5. The information crawling system according to claim 2, wherein the links to be crawled comprise links to be crawled without embedded links and links to be crawled with embedded connections, and the resource scheduling system is adapted to preferentially use the links to be crawled without embedded links as the links to be crawled and allocate the links to the intelligent crawling nodes.
6. The information crawling system of claim 2, wherein the resource scheduling system is adapted to add the links to be crawled to a set of candidate links and perform de-duplication processing on each link in the set of candidate links.
7. The information crawling system of claim 6, wherein the resource scheduling system performs deduplication processing on each link in the candidate link set based on a Berkeley DB persistence storage technique.
CN201910510474.3A 2019-06-13 2019-06-13 Information crawling system Active CN111339388B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910510474.3A CN111339388B (en) 2019-06-13 2019-06-13 Information crawling system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910510474.3A CN111339388B (en) 2019-06-13 2019-06-13 Information crawling system

Publications (2)

Publication Number Publication Date
CN111339388A CN111339388A (en) 2020-06-26
CN111339388B true CN111339388B (en) 2021-07-27

Family

ID=71185076

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910510474.3A Active CN111339388B (en) 2019-06-13 2019-06-13 Information crawling system

Country Status (1)

Country Link
CN (1) CN111339388B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112637049A (en) * 2020-12-16 2021-04-09 广州索答信息科技有限公司 Data capture system and method

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102264077A (en) * 2011-07-22 2011-11-30 华为技术有限公司 Node deployment method and node of sensor network
CN108804505A (en) * 2018-04-12 2018-11-13 阿里巴巴集团控股有限公司 Data processing method, terminal device and server
CN109815384A (en) * 2019-01-29 2019-05-28 携程旅游信息技术(上海)有限公司 Method, system, equipment and the storage medium that crawler is realized

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101969445B (en) * 2010-11-03 2014-12-17 中国电信股份有限公司 Method and device for defensing DDoS (Distributed Denial of Service) and CC (Connections Flood) attacks
CN102355488B (en) * 2011-08-15 2014-01-22 北京星网锐捷网络技术有限公司 Crawler seed obtaining method and equipment and crawler crawling method and equipment
US9304885B2 (en) * 2013-06-18 2016-04-05 International Business Machines Corporation Passive monitoring of virtual systems using agent-less, near-real-time indexing
US20160162596A1 (en) * 2014-09-05 2016-06-09 Hamlet Francisco Batista Reyes System and Method for Real-time Search Engine Optimization Issue Detection and Correction
CN106126688B (en) * 2016-06-29 2020-03-24 厦门趣处网络科技有限公司 Intelligent network information acquisition system and method based on WEB content and structure mining
CN106803167A (en) * 2017-02-28 2017-06-06 深圳海带宝网络科技股份有限公司 A kind of cross-border electric business whole world goods clear customs system
CN107025296B (en) * 2017-04-17 2018-11-06 山东辰华科技信息有限公司 Based on science service information intelligent grasping system method of data capture
CN107729508A (en) * 2017-10-23 2018-02-23 北京京东金融科技控股有限公司 Information crawler method and apparatus

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102264077A (en) * 2011-07-22 2011-11-30 华为技术有限公司 Node deployment method and node of sensor network
CN108804505A (en) * 2018-04-12 2018-11-13 阿里巴巴集团控股有限公司 Data processing method, terminal device and server
CN109815384A (en) * 2019-01-29 2019-05-28 携程旅游信息技术(上海)有限公司 Method, system, equipment and the storage medium that crawler is realized

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
提供动态IP服务的行为定性;门美子;《中国检察官》;20180320(第288期);08-12 *

Also Published As

Publication number Publication date
CN111339388A (en) 2020-06-26

Similar Documents

Publication Publication Date Title
JP6272933B2 (en) Remote browsing session management
US10572285B2 (en) Method and apparatus for elastically scaling virtual machine cluster
CN102460391B (en) Systems and methods for providing virtual appliance in application delivery fabric
US8589385B2 (en) Historical browsing session management
US8849802B2 (en) Historical browsing session management
CN107707943B (en) A kind of method and system for realizing cloud service fusion
US20100076933A1 (en) Techniques for resource location and migration across data centers
AU2016202333B2 (en) Historical browsing session management
CN111459986B (en) Data computing system and method
CN103685304A (en) Method and system for sharing session information
CN107807937B (en) Website SEO processing method, device and system
CN109829121B (en) Method and device for reporting click behavior data
CN111814024B (en) Distributed data acquisition method, system and storage medium
US8972477B1 (en) Offline browsing session management
CA3059738A1 (en) Behaviour data processing method, device, electronic device and computer readable medium
CN111092921A (en) Data acquisition method, device and storage medium
EP2808792B1 (en) Method and system for using arbitrary computing devices for distributed data processing
US20120143866A1 (en) Client Performance Optimization by Delay-Loading Application Files with Cache
CN111328394A (en) Locally secure rendering of WEB content
CN110413846B (en) Data processing method and device for webpage mirror image and computer readable storage medium
US9471389B2 (en) Dynamically tuning server placement
CN111339388B (en) Information crawling system
US6898599B2 (en) Method and system for automated web reports
CN111800511B (en) Synchronous login state processing method, system, equipment and readable storage medium
CN112929237B (en) Analysis method, system, equipment and medium for website subdivision flow

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant