CN113239253B - Method, system, computing device and storage medium for realizing web crawler - Google Patents

Method, system, computing device and storage medium for realizing web crawler Download PDF

Info

Publication number
CN113239253B
CN113239253B CN202110383241.9A CN202110383241A CN113239253B CN 113239253 B CN113239253 B CN 113239253B CN 202110383241 A CN202110383241 A CN 202110383241A CN 113239253 B CN113239253 B CN 113239253B
Authority
CN
China
Prior art keywords
message queue
web
urls
webpages
web crawler
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110383241.9A
Other languages
Chinese (zh)
Other versions
CN113239253A (en
Inventor
王新章
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Pierbulaini Software Co ltd
Original Assignee
Beijing Pierbulaini Software Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Pierbulaini Software Co ltd filed Critical Beijing Pierbulaini Software Co ltd
Priority to CN202110383241.9A priority Critical patent/CN113239253B/en
Publication of CN113239253A publication Critical patent/CN113239253A/en
Application granted granted Critical
Publication of CN113239253B publication Critical patent/CN113239253B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • G06F9/546Message passing systems or structures, e.g. queues

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a web crawler realization method, which is suitable for being executed in a web crawler realization system, and comprises the following steps: the message queue server receives a web crawler task from a user; the message queue server establishes a plurality of message queues according to a plurality of query keywords of the web crawler task; the webpage acquisition server crawls a plurality of webpages related to the query keywords from the Internet according to the query keywords, and sends URLs of the webpages to the message queue server; the message queue server receives the URLs of the webpages and stores the URLs of the webpages into a message queue corresponding to the target query keyword; the content grabber acquires URLs of a plurality of webpages related to the query keywords from a message queue, downloads the webpages according to the URLs of the webpages to obtain a plurality of webpages, grabs information from the webpages and generates web crawler results. The invention also discloses a web crawler implementation system, a computing device and a computer readable storage medium.

Description

Method, system, computing device and storage medium for realizing web crawler
Technical Field
The present invention relates to the field of internet, and in particular, to a web crawler implementation system, a web crawler implementation method, a computing device, and a storage medium.
Background
With the development of internet technology, retrieval through the internet has become an important means for obtaining information. When retrieving information from the internet, crawling information from web pages in the internet is a key step. Thus, web crawler technology has evolved.
In the existing web Crawler technology, a web Crawler design scheme of a Crawler4j frame or a Nutch frame is often used for crawling web pages, and the web Crawler technology has the characteristics of convenience in use and the like. However, the web Crawler design scheme of the Crawler4j framework cannot perform distributed deployment of web crawlers, and when a plurality of web pages need to be crawled, the crawling efficiency and the crawling speed are reduced. After the web crawler design scheme of the Nutch framework crawls the web page, the target content in the web page cannot be extracted accurately, so that effective data can be extracted efficiently.
For this reason, a new web crawler implementation method and system are needed.
Disclosure of Invention
To this end, the present invention provides a web crawler implementation method and system in an effort to solve or at least alleviate the above-identified problems.
According to one aspect of the present invention, there is provided a web crawler implementation method adapted to be executed in a web crawler implementation system including a message queue server, and a plurality of web page collectors and a plurality of content grippers communicatively connected to the message queue server, the method comprising the steps of: the message queue server receives a web crawler task from a user, wherein the web crawler task comprises a plurality of query keywords; the message queue server establishes a plurality of message queues according to a plurality of query keywords of the web crawler task, wherein each query keyword corresponds to one message queue; the web page acquisition server crawls a plurality of web pages related to the query keywords from the Internet according to one query keyword in the web crawler task, and sends the URLs of the web pages to the message queue server; the message queue server receives the URLs of the webpages and stores the URLs of the webpages into a message queue corresponding to the target query keyword; the content grabber acquires URLs of a plurality of webpages related to the query keywords from one message queue in the message queue server, downloads the webpages according to the URLs of the webpages to obtain the webpages, grabs information from the webpages, and generates a web crawler result.
Optionally, in the method according to the present invention, the step of crawling the web page collection server from the internet to obtain a plurality of web pages related to the query keyword includes: the web page acquisition server constructs a regular expression of the query keyword; and matching the web pages with the same theme as the query keywords according to the regular expression.
Optionally, in the method according to the present invention, the content crawler acquires URLs of a plurality of web pages related to the query keyword from one message queue in the message queue server, including the steps of: the content grabber establishes a message channel with one message queue in the message queue server, and obtains URLs of a plurality of web pages related to the target query keyword through the message channel.
Optionally, in the method according to the present invention, the content crawler crawls information from a plurality of web pages, and generating the web crawler result includes the steps of: the content grabber generates a configuration file according to one query keyword in the web crawler task; capturing target information for each webpage according to the configuration file to obtain a target information set of the query keywords; and de-duplicating the target information set, and forming the de-duplicated target information set into a web crawler result of the query keyword.
Optionally, in the method according to the present invention, the system further comprises a URL analyzer, the plurality of web page collectors in the system being communicatively connected to the message queue server through the URL analyzer, the method further comprising the steps of: the webpage collector sends URLs of a plurality of webpages to the URL analyzer; the URL analyzer filters and de-weight the URLs of the webpages sent by the webpage collector according to preset rules; the URL analyzer sends the URLs of the filtered and re-arranged plurality of web pages to the message queue server.
Optionally, in the method according to the present invention, the system further comprises a storage server communicatively connected to the plurality of content collectors, the method further comprising: the content collector stores the web crawler results in a storage server so as to facilitate the inquiry and acquisition of the user.
According to another aspect of the present invention, there is provided a web crawler implementation system, the system comprising a message queue server, and a plurality of web page collectors and a plurality of content crawlers communicatively connected to the message queue server, wherein the message queue server is adapted to receive a web crawler task from a user, the web crawler task comprising a plurality of query keywords, a plurality of message queues being established according to the plurality of query keywords of the web crawler task, each query keyword corresponding to one message queue; the webpage acquisition server is suitable for crawling a plurality of webpages related to the query keywords from the Internet according to one query keyword in the web crawler task, and sending the URLs of the webpages to the message queue server; the message queue server also comprises a exchanger, the exchanger is suitable for receiving the URLs of the webpages and storing the URLs of the webpages into a message queue corresponding to the target query keyword; the content grabber is suitable for acquiring URLs of a plurality of webpages related to the query keywords from one message queue in the message queue server, downloading the URLs of the webpages to obtain the webpages, grabbing information from the webpages and generating a web crawler result.
Optionally, in the system according to the invention, the web page collecting server is further adapted to: constructing a regular expression of the query keyword; and matching the web pages with the same theme as the query keywords according to the regular expression.
Optionally, in the system according to the invention, the content grabber is further adapted to: and establishing a message channel with one message queue in the message queue server, and acquiring the URLs of a plurality of webpages related to the target query keyword through the message channel.
Optionally, in the system according to the invention, the content grabber is further adapted to: generating a configuration file according to one query keyword in the web crawler task; capturing target information for each webpage according to the configuration file to obtain a target information set of the query keywords; and de-duplicating the target information set, and forming the de-duplicated target information set into a web crawler result of the query keyword.
Optionally, in the system according to the present invention, further comprising a URL analyzer, the plurality of web page collectors in the system being communicatively connected to the message queue server through the URL analyzer, the web page collectors being further adapted to send URLs of the plurality of web pages to the URL analyzer; the URL analyzer is suitable for filtering and re-ranking the URLs of the plurality of webpages sent by the webpage collector according to preset rules, and sending the URLs of the plurality of webpages after filtering and re-ranking to the message queue server.
Optionally, in the system according to the present invention, a storage server is further included that is communicatively connected to the plurality of content collectors, and the storage server is adapted to receive and store web crawler results sent by the content collection server, so as to facilitate querying and obtaining by a user.
According to yet another aspect of the present invention, there is provided a computing device comprising: one or more processors; a memory; and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs comprising instructions for performing any of the methods of implementing a web crawler according to the present invention.
According to yet another aspect of the present invention, there is provided a computer readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by a computing device, cause the computing device to perform any of the methods of web crawler implementation according to the present invention.
The web crawler implementation system comprises a plurality of web page collectors and a plurality of content grippers which can be distributed and deployed, and all the web page collectors and the content grippers are in communication connection with a message queue server. The web crawler tasks received by the message queue server can be divided and executed by adopting a multi-server distributed deployment mode through the web page collector and the content grabber, so that the target web page can be quickly crawled, and the method is suitable for completing a large number of heavy web crawler tasks. And by adding a mode of deploying the webpage collector into the system, the system can be rapidly expanded, and the webpage crawling capability of the network crawler realizing the system is improved.
The web crawler implementation system simultaneously adopts the message queue server as an intermediate piece between the webpage collector and the content grabber, so that the webpage crawling and the webpage content analysis are decoupled, and the collaborative work efficiency of the webpage collector and the content grabber is improved.
And further, the web crawler implementation system further comprises a URL analyzer, the web page collectors are all in communication connection with the message queue server through the URL analyzer, the URL analyzer filters and weight-arranges the web pages crawled by the web page collectors, and the URLs of the multiple web pages after filtering and weight-arranging are sent to the message queue server so as to ensure that the web pages collected by the web page collectors are all web pages required by collecting information.
Drawings
To the accomplishment of the foregoing and related ends, certain illustrative aspects are described herein in connection with the following description and the annexed drawings, which set forth the various ways in which the principles disclosed herein may be practiced, and all aspects and equivalents thereof are intended to fall within the scope of the claimed subject matter. The above, as well as additional objects, features, and advantages of the present disclosure will become more apparent from the following detailed description when read in conjunction with the accompanying drawings. Like reference numerals generally refer to like parts or elements throughout the present disclosure.
FIG. 1 illustrates a schematic diagram of a web crawler implementation system 100 in accordance with one exemplary embodiment of the present invention;
FIG. 2 illustrates a block diagram of a computing device 200 according to an exemplary embodiment of the invention; and
FIG. 3 illustrates a flow diagram providing a web crawler implementation method 300 according to an exemplary embodiment of the invention.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art. Like reference numerals generally refer to like parts or elements.
FIG. 1 illustrates a schematic diagram of a web crawler implementation system 100 according to one exemplary embodiment of the present invention. As shown in FIG. 1, system 100 includes web page collectors 111-113, URL analyzer 120, message queue server 130, content grippers 141-143, and storage server 150. Wherein, the web page collectors 111 to 113 are respectively connected with the URL analyzer 120 in communication, the message queue server 130 is connected with the URL analyzer 120 and the content grippers 141 to 143 in communication, and the content grippers 141 to 143 are respectively connected with the storage server 150 in communication. The manner in which the web crawler implementation system 100 shown in fig. 1 is connected is merely exemplary, and the present invention does not limit the number of web page collectors and content grippers included in the web crawler implementation system 100.
According to one embodiment of the present invention, the message queue server 130 may be implemented as a RabbitMQ server, and the present invention is not limited to the type of message queue server 130. Message queue server 130 receives web crawler tasks from users, including a plurality of query keywords. The message queue server 130 provides a communication interface for communicating with users, and when a user deploys a web crawler task according to service needs, the user can send a plurality of packaged query keywords to the message queue server 130 through the communication interface. Each query keyword in the web crawler task corresponds to a topic of a web page to be crawled by a web page collector. Included in message queue server 130 is message queue module 132. The message queue server 130 establishes a plurality of message queues in the message queue module 132 according to a plurality of query keywords of the web crawler task, each query keyword corresponding to a message queue, each message queue receiving a page crawled by a web page collector.
The web page collectors 111-113 may be implemented as a cluster of web page collectors deployed in a distributed manner. The web page collectors 111-113 crawl a plurality of web pages related to the query keywords from the internet according to one query keyword in the web crawler task. According to one embodiment of the present invention, when the web crawler task received by the message queue server 130 includes three query keywords, the web page collectors 111 to 113 each select one of the three query keywords, crawl web pages related to the query keywords from the internet according to the selected query keyword, and the query keywords selected by each web page collector are different from each other. After crawling a plurality of web pages related to the query keyword, the web page collectors 111 to 113 send URLs of the plurality of web pages to the message queue server 130.
Also included in message queue server 130 is switch 131. The exchanger 131 receives URLs of a plurality of web pages, and stores the URLs of the plurality of web pages in a message queue corresponding to a target query keyword. The web page collector and message queue in the message queue module 132 collectively correspond to an item tag keyword. The exchanger 131 distributes and stores URLs of the received web pages according to target keywords corresponding to the web page collector and the message queue in common.
According to one embodiment of the invention, message queue server 130 accepts a web crawler task sent by a user. The web crawler task comprises three query keywords of brand 1, vehicle system A and vehicle system B. Message queue server 130 establishes message queues 1-3 in message queue module 132 corresponding to "brand 1", "train a", and "train B", respectively. The web page collector 111 selects "car system a" as the subject of crawling to perform web page crawling; the webpage collector 112 selects the car system B as the crawling theme to crawl the webpage; the web page collector 113 selects "brand 1" as the subject of crawling to crawl web pages. After crawling the web page of the corresponding query keyword, the web page collectors 111 to 113 send the URL of the web page to the exchanger 131 of the message queue server 130. Since the query keyword corresponding to the message queue 2 is "train a", the query keyword selected by the web page collector 111 is also "train a". Therefore, when the exchanger 131 receives the URL of the web page collected by the web page collector 111, it stores it in the message queue 2. Similarly, exchanger 131 distributes and stores URLs of web pages crawled by web page collectors 112 and 113.
The content crawlers 141 to 143 obtain URLs of a plurality of web pages related to the query keyword from one message queue in the message queue server 130, download the URLs of the web pages according to the URLs of the web pages to obtain a plurality of web pages, crawl information from the plurality of web pages, and generate a web crawler result. The content grippers 141-143 may be implemented as a distributed deployed content gripper cluster. The content grippers 141 to 143 arbitrarily select one message queue from the message queue module 132 and acquire URLs of a plurality of web pages related to the query keyword therefrom. The message queues selected by each content crawler are different from each other. The content grippers 141 to 143 download the web page according to the URL of the web page, and grasp information from the downloaded web page according to the query keyword corresponding to the selected message queue.
Also included in web crawler implementation system 100 is URL analyzer 120. The plurality of web page collectors 111-113 in the system are all communicatively coupled to the message queue server 130 via the URL parser 120. The URL analyzer 120 filters and rearranges URLs of the plurality of web pages transmitted from the web page collectors 111 to 113 according to a preset rule, and transmits the filtered and rearranged URLs of the plurality of web pages to the message queue server 130.
Also included in web crawler implementation system 100 is a storage server 150. Storage server 150 may be implemented as a distributed deployment storage server, and the specific type and deployment form of storage server 150 of the present invention is not limited. The storage server 150 receives and stores the web crawler results transmitted from the content acquisition server to facilitate the user's query and acquisition.
Web page collectors 111-113, URL analyzer 120, message queue server 130, content crawlers 141-143, and storage server 150 in web crawler implementation system 100 may all be implemented as one computing device. FIG. 2 illustrates a block diagram of a computing device 200 according to an exemplary embodiment of the invention. As shown in FIG. 2, in a basic configuration 202, computing device 200 typically includes a system memory 206 and one or more processors 204. A memory bus 208 may be used for communication between the processor 204 and the system memory 206.
Depending on the desired configuration, the processor 204 may be any type of processing including, but not limited to: a microprocessor (μp), a microcontroller (μc), a digital information processor (DSP), or any combination thereof. Processor 204 may include one or more levels of cache, such as a first level cache 210 and a second level cache 212, a processor core 214, and registers 216. The example processor core 214 may include an Arithmetic Logic Unit (ALU), a Floating Point Unit (FPU), a digital signal processing core (DSP core), or any combination thereof. The example memory controller 218 may be used with the processor 204, or in some implementations, the memory controller 218 may be an internal part of the processor 204.
Depending on the desired configuration, system memory 206 may be any type of memory including, but not limited to: volatile memory (such as RAM), non-volatile memory (such as ROM, flash memory, etc.), or any combination thereof. The system memory 206 may include an operating system 220, one or more programs 222, and program data 224. In some implementations, the program 222 may be arranged to execute instructions 223 of the method 300 according to the present invention on an operating system by the one or more processors 204 using the program data 224.
Computing device 200 may also include a storage interface bus 234. Storage interface bus 234 enables communication from storage devices 232 (e.g., removable storage 236 and non-removable storage 238) to base configuration 202 via bus/interface controller 230. At least a portion of the operating system 220, applications 222, and data 224 may be stored on removable storage 236 and/or non-removable storage 238 and loaded into the system memory 206 via the storage interface bus 234 and executed by the one or more processors 204 when the computing device 200 is powered up or the application 222 is to be executed.
Computing device 200 may also include an interface bus 240 that facilitates communication from various interface devices (e.g., output devices 242, peripheral interfaces 244, and communication devices 246) to basic configuration 202 via bus/interface controller 230. The example output device 242 includes a graphics processing unit 248 and an audio processing unit 250. They may be configured to facilitate communication with various external devices, such as a display or speakers, via one or more a/V ports 252. The example peripheral interface 244 may include a serial interface controller 254 and a parallel interface controller 256, which may be configured to facilitate communication via one or more I/O ports 258 and external devices such as input devices (e.g., keyboard, mouse, pen, voice input device, touch input device) or other peripherals (e.g., printer, scanner, etc.). The example communication device 246 may include a network controller 260 that may be arranged to facilitate communication with one or more other computing devices 262 over a network communication link via one or more communication ports 264.
The network communication link may be one example of a communication medium. Communication media may typically be embodied by computer readable instructions, data structures, program modules, and may include any information delivery media in a modulated data signal, such as a carrier wave or other transport mechanism. A "modulated data signal" may be a signal that has one or more of its data set or changed in such a manner as to encode information in the signal. By way of non-limiting example, communication media may include wired media such as a wired network or special purpose network, and wireless media such as acoustic, radio Frequency (RF), microwave, infrared (IR) or other wireless media. The term computer readable media as used herein may include both storage media and communication media.
In computing device 200 according to the present invention, application 222 includes a plurality of program instructions to perform web crawler implementation method 300 that may instruct processor 204 to perform some of the steps in web crawler implementation method 300 running in computing device 200 of the present invention so that portions of computing device 200 implement crawling of web pages by performing web crawler implementation method 300 of the present invention.
Computing device 200 may be implemented as a server, such as file server 240, database 250, a server, an application server, etc., such as a Personal Digital Assistant (PDA), a wireless web-browsing device, an application-specific device, or a hybrid device that may include any of the above functions. May be implemented as a personal computer including desktop and notebook computer configurations, and in some embodiments, computing device 200 is configured to perform web crawler implementation method 300.
FIG. 3 illustrates a flow diagram of a web crawler implementation method 300 according to an exemplary embodiment of the invention. The method 300 is performed in the web crawler implementation system 100. As shown in FIG. 3, the web crawler implementation method 300 begins at step S310 with the message queue server 130 receiving a web crawler task from a user, the web crawler task including a plurality of query terms. The query keyword corresponds to a service identifier in an actual service, and can be specifically implemented as a user type, a train type and the like of the service identifier. According to one embodiment of the invention, the message queue server 130 accepting a web crawler task from a user includes: three query keywords of "brand 1", "train A" and "train B".
Subsequently, step S320 is performed, where the message queue server 130 establishes a plurality of message queues according to a plurality of query keywords of the web crawler task, each of the query keywords corresponding to one of the message queues. According to one embodiment of the invention, message queue server 130 establishes message queues 1-3 in message queue module 132, corresponding to "brand 1", "train A", and "train B", respectively.
Subsequently, step S330 is performed, where the web page collecting server crawls a plurality of web pages related to the query keyword from the internet according to one query keyword in the web crawler task, and sends URLs of the plurality of web pages to the message queue server 130. According to one embodiment of the invention, the web page collector 111 selects "Car system A" as the subject of crawling for web page crawling; the webpage collector 112 selects the car system B as the crawling theme to crawl the webpage; the web page collector 113 selects "brand 1" as the subject of crawling to crawl web pages.
When the webpage collecting server crawls a plurality of webpages related to the query keywords from the Internet, the webpage collecting server constructs regular expressions of the query keywords, and webpages with the same theme as the query keywords are matched according to the regular expressions. Regular expressions are typically used to retrieve text that meets a certain rule. The web page acquisition service area searches whether the query keyword is contained in the web page by constructing a policy expression of the query keyword, and if the query keyword is contained, the web page is used as the web page with the same theme as the query keyword.
When the web page collector 111 selects "car a" as the subject of crawling for web page crawling, a regular expression of "car a" is constructed, so as to retrieve a web page containing "car a". And when the webpage containing the train A is crawled, the webpage is crawled by adopting a mode of focusing a web crawler. Subsequently, the web page collector 111 transmits URLs of web pages including "train a" as web pages having the same subject as the query keyword to the message queue server 130.
According to one embodiment of the present invention, when the web page collectors 111 to 113 transmit URLs of the crawled web pages to the message queue server 130, the URLs of the web pages are first transmitted to the URL analyzer 120. The URL analyzer 120 filters and de-multiplexes URLs of the plurality of web pages transmitted from the web page collectors 111 to 113 according to a preset rule. The preset rules comprise: URL analyzer 120 de-duplication of URLs of received web pages, removing duplicate web pages; and then judging whether the multiple webpages obtained after the duplicate removal include a certain number of search keywords, for example judging whether the webpages include 5 or more car systems A, so as to determine whether the theme of the webpages is car systems A again and filter the webpages sent by the webpage collector 111. The URL analyzer 120 may also determine whether the topic of the web page is "car train a" through a semantic analysis algorithm. The invention is not limited to the preset rules for the URL analyzer 120 to perform the weight removing and filtering on the web pages, and a developer can self-plan the preset rules for the weight removing and filtering according to the actual service requirements. Finally, the URL analyzer 120 transmits the URLs of the filtered and rearranged plurality of web pages to the message queue server 130.
Subsequently, step S340 is performed, in which the message queue server 130 receives URLs of a plurality of web pages and stores the URLs of the plurality of web pages in the message queue corresponding to the target query keyword. The web page collector and message queue in the message queue module 132 collectively correspond to an item tag keyword. The exchanger 131 receives URLs of a plurality of web pages, and distributes and stores the URLs of the received web pages according to target keywords corresponding to the web page collector and the message queue in common. According to one embodiment of the present invention, the query keyword corresponding to the message queue 2 is "car-train a", and the query keyword selected by the web page collector 111 is also "car-train a". Therefore, when the exchanger 131 receives the URL of the web page collected by the web page collector 111, it stores it in the message queue 2. Similarly, exchanger 131 distributes and stores URLs of web pages crawled by web page collectors 112 and 113.
Finally, step S350 is executed, where the content crawlers 141-143 obtain URLs of a plurality of web pages related to the query keyword from one of the message queues in the message queue server 130, download the URLs of the web pages according to the URLs of the web pages to obtain a plurality of web pages, crawl information from the plurality of web pages, and generate a web crawler result. The content grippers 141 to 143 select one message queue from the message queue module 132 in the message queue server 130, establish a message channel, and acquire URLs of a plurality of web pages related to the target query keyword through the message channel. Each content grabber and one message queue establish a message channel, and the message channel is an independent bidirectional data flow channel; all content grippers 141-143 multiplex one TCP connection with all message channels established by message queue server 130.
The web page collectors 111-113 and the content grippers 141-143 adopt a multi-server distributed deployment mode, and can segment and execute the web crawler tasks received by the message queue server 130, so that target web pages can be quickly crawled, and a large number of heavy web crawler tasks can be completed. The web crawler implementation system 100 adopts the message queue server 130 as a middleware between the webpage collectors 111-113 and the content grippers 141-143, so that the webpage crawling and webpage content analysis are decoupled, and the content grippers 141-143 and the webpage collectors 111-113 are not mutually affected when running, so that the overall working efficiency of the system is improved.
When the content grippers 141-143 grip information from a plurality of web pages and generate a web crawler result, the content grippers 141-143 first generate a configuration file according to one query keyword in the web crawler task. The configuration file is a configuration item to be queried under the query keyword. Multiple configuration items may be included under one query keyword. According to one embodiment of the present invention, the content crawler 141 generates a configuration file of the query keyword "car-set a" after receiving URLs of a plurality of web pages on the subject of "car-set a" stored in the message queue 2. The configuration file includes configuration items such as brand name of "train A", manufacturer price, transmitter parameters, etc.
Then, the content grippers 141 to 143 grasp target information for each web page according to the configuration file, and obtain a target information set of the query keyword. According to one embodiment of the present invention, the content grabbers 141 to 143 may extract information in the pages through Xpath language while grabbing target information according to the configuration file for each web page. The content crawler 141 parses the web page, searches for the brand name of "train a", vendor price, transmitter parameters, etc. using the Xpath language.
And finally, carrying out duplication elimination on the target information set, and forming the duplicated target information set into a web crawler result of the query keyword. The content grabbers 141 to 143 perform target information grabbing on all the webpages crawled by the webpage collectors 111 to 113, and then perform deduplication on the crawled information. According to one embodiment of the present invention, if the content crawler 141 finds the brand name of "train a" from both page a and page b, then one item is reserved as the brand name of "train a"; the target information set including the brand name, vendor price, and transmitter parameters of "train A" is then assembled into the web crawler result of "train A".
According to one embodiment of the invention, the system further includes a storage server 150 communicatively coupled to the plurality of content collectors, the content collectors storing web crawler results in the storage server 150 for easy querying and retrieval by the user. The storage server 150 stores all web crawler results of the web crawler task sent by the user, and the user can refer to the storage server 150 or call in other business tiles.
According to one embodiment of the present invention, the web crawler implementation system 100 of the present invention also supports the breakpoint continuous crawling function of the task for the web crawler. When the web crawler implementation system 100 receives a pause instruction actively initiated by a user or is abnormally interrupted due to external reasons while processing a web crawler task, the processing web crawler task may be paused. URL analyzer 120 may store the verified URL and storage server 150 may store the web crawler results that content collector has generated. When the web crawler 100 is started again, the URL analyzer 120 determines whether the URLs sent by the web page collectors 111-113 are verified, and if so, processes the next URL. By adopting the technical means, the method and the device for processing the web crawler can save the processed web crawler results, avoid repeated verification of the URL, improve the processing efficiency of the web crawler task, and realize the breakpoint continuous climbing function.
The web crawler implementation system comprises a plurality of web page collectors and a plurality of content grippers which can be distributed and deployed, and all the web page collectors and the content grippers are in communication connection with a message queue server. The web crawler tasks received by the message queue server can be divided and executed by adopting a multi-server distributed deployment mode through the web page collector and the content grabber, so that the target web page can be quickly crawled, and the method is suitable for completing a large number of heavy web crawler tasks. And by adding a mode of deploying the webpage collector into the system, the system can be rapidly expanded, and the webpage crawling capability of the network crawler realizing the system is improved.
The web crawler implementation system simultaneously adopts the message queue server as an intermediate piece between the webpage collector and the content grabber, so that the webpage crawling and the webpage content analysis are decoupled, and the collaborative work efficiency of the webpage collector and the content grabber is improved.
And further, the web crawler implementation system further comprises a URL analyzer, the web page collectors are all in communication connection with the message queue server through the URL analyzer, the URL analyzer filters and weight-arranges the web pages crawled by the web page collectors, and the URLs of the multiple web pages after filtering and weight-arranging are sent to the message queue server so as to ensure that the web pages collected by the web page collectors are all web pages required by collecting information.
In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.
A5. the method of A4, wherein the system further comprises a URL analyzer through which a plurality of web page collectors in the system are communicatively connected to the message queue server, the method further comprising the steps of:
the web page collector sends URLs of a plurality of web pages to the URL analyzer;
the URL analyzer filters and de-weights the URLs of the webpages sent by the webpage collector according to preset rules;
the URL analyzer sends the filtered and rearranged URLs of the plurality of webpages to the message queue server.
A6. the method of A5, wherein the system further comprises a storage server communicatively coupled to the plurality of content collectors, the method further comprising:
the content collector stores the web crawler results into the storage server so as to facilitate the user to inquire and acquire.
B11, the system of B10, further comprising a URL analyzer, a plurality of web page collectors in the system being communicatively connected to the message queue server through the URL analyzer, the web page collectors being further adapted to send the URL analyzer the URLs of a plurality of web pages;
the URL analyzer is suitable for filtering and re-ranking the URLs of the plurality of webpages sent by the webpage collector according to preset rules, and sending the URLs of the plurality of webpages after filtering and re-ranking to the message queue server.
The system of B12, B11, further comprising a storage server communicatively coupled to the plurality of content collectors, the storage server adapted to receive and store web crawler results sent by the content collection server for easy querying and retrieval by a user.
Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be construed as reflecting the intention that: i.e., the claimed invention requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.
Those skilled in the art will appreciate that the modules or units or groups of devices in the examples disclosed herein may be arranged in a device as described in this embodiment, or alternatively may be located in one or more devices different from the devices in this example. The modules in the foregoing examples may be combined into one module or may be further divided into a plurality of sub-modules.
Those skilled in the art will appreciate that the modules in the apparatus of the embodiments may be adaptively changed and disposed in one or more apparatuses different from the embodiments. The modules or units or groups of embodiments may be combined into one module or unit or group, and furthermore they may be divided into a plurality of sub-modules or sub-units or groups. Any combination of all features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or units of any method or apparatus so disclosed, may be used in combination, except insofar as at least some of such features and/or processes or units are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings), may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.
Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features but not others included in other embodiments, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the following claims, any of the claimed embodiments can be used in any combination.
Furthermore, some of the embodiments are described herein as methods or combinations of method elements that may be implemented by a processor of a computer system or by other means of performing the functions. Thus, a processor with the necessary instructions for implementing the described method or method element forms a means for implementing the method or method element. Furthermore, the elements of the apparatus embodiments described herein are examples of the following apparatus: the apparatus is for carrying out the functions performed by the elements for carrying out the objects of the invention.
The various techniques described herein may be implemented in connection with hardware or software or, alternatively, with a combination of both. Thus, the methods and apparatus of the present invention, or certain aspects or portions of the methods and apparatus of the present invention, may take the form of program code (i.e., instructions) embodied in tangible media, such as floppy diskettes, CD-ROMs, hard drives, or any other machine-readable storage medium, wherein, when the program is loaded into and executed by a machine, such as a computer, the machine becomes an apparatus for practicing the invention.
In the case of program code execution on programmable computers, the computing device will generally include a processor, a storage medium readable by the processor (including volatile and non-volatile memory and/or storage elements), at least one input device, and at least one output device. Wherein the memory is configured to store program code; the processor is configured to execute the inventive method of determining a shutdown state of the device in accordance with instructions in said program code stored in the memory.
By way of example, and not limitation, computer readable media comprise computer storage media and communication media. Computer-readable media include computer storage media and communication media. Computer storage media stores information such as computer readable instructions, data structures, program modules, or other data. Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. Combinations of any of the above are also included within the scope of computer readable media.
As used herein, unless otherwise specified the use of the ordinal terms "first," "second," "third," etc., to describe a general object merely denote different instances of like objects, and are not intended to imply that the objects so described must have a given order, either temporally, spatially, in ranking, or in any other manner.
While the invention has been described with respect to a limited number of embodiments, those skilled in the art, having benefit of the above description, will appreciate that other embodiments are contemplated within the scope of the invention as described herein. Furthermore, it should be noted that the language used in the specification has been principally selected for readability and instructional purposes, and may not have been selected to delineate or circumscribe the inventive subject matter. Accordingly, many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the appended claims. The disclosure of the present invention is intended to be illustrative, but not limiting, of the scope of the invention, which is defined by the appended claims.

Claims (14)

1. A web crawler implementation method adapted to be executed in a web crawler implementation system, the system comprising a message queue server, and a plurality of web page collectors and a plurality of content crawlers communicatively connected to the message queue server, the method comprising the steps of:
the message queue server receives a web crawler task from a user, wherein the web crawler task comprises a plurality of query keywords;
The message queue server establishes a plurality of message queues according to a plurality of query keywords of the web crawler task, wherein each query keyword corresponds to one message queue;
the webpage collector crawls a plurality of webpages related to the query keywords from the Internet according to one query keyword in the web crawler task, and sends the URLs of the webpages to the message queue server, wherein each webpage collector corresponds to one query keyword;
the message queue server receives the URLs of a plurality of webpages and stores the URLs of the webpages into a message queue corresponding to the target query keyword;
the content grabber acquires URLs of a plurality of webpages related to the query keyword from one message queue in the message queue server, downloads the webpages according to the URLs of the webpages to acquire information from the webpages, and generates a web crawler result, wherein each content grabber corresponds to one message queue.
2. The method of claim 1, wherein crawling a plurality of web pages related to the query keyword from the internet by the web page collector comprises the steps of:
The webpage collector builds a regular expression of the query keyword;
and matching the web pages with the same theme as the query keywords according to the regular expression.
3. The method of claim 2, wherein the content crawler obtains URLs of a plurality of web pages related to query keywords from one of the message queues in the message queue server, comprising the steps of:
and the content grabber establishes a message channel with one message queue in the message queue server, and acquires the URLs of a plurality of webpages related to the target query keyword through the message channel.
4. The method of claim 3, wherein the content crawler crawling information from the plurality of web pages, generating web crawler results, comprises the steps of:
the content grabber generates a configuration file according to one query keyword in the web crawler task;
capturing target information for each webpage according to the configuration file to obtain a target information set of query keywords;
and de-duplicating the target information set, and forming the de-duplicated target information set into a web crawler result of the query keyword.
5. The method of claim 4, wherein the system further comprises a URL parser through which a plurality of web page collectors in the system are communicatively connected to the message queue server, the method further comprising the steps of:
The web page collector sends URLs of a plurality of web pages to the URL analyzer;
the URL analyzer filters and de-weights the URLs of the webpages sent by the webpage collector according to preset rules;
the URL analyzer sends the filtered and rearranged URLs of the plurality of webpages to the message queue server.
6. The method of claim 5, wherein the system further comprises a storage server communicatively coupled to the plurality of content collectors, the method further comprising:
the content collector stores the web crawler results into the storage server so as to facilitate the user to inquire and acquire.
7. A web crawler implementation system, the system comprising a message queue server, and a plurality of web page collectors and a plurality of content crawlers communicatively connected with the message queue server, wherein the message queue server is adapted to receive a web crawler task from a user, the web crawler task comprises a plurality of query keywords, a plurality of message queues are established according to the plurality of query keywords of the web crawler task, and each query keyword corresponds to one message queue;
the webpage collector is suitable for crawling a plurality of webpages related to the query keywords from the Internet according to one query keyword in the web crawler task, and sending the URLs of the webpages to the message queue server, wherein each webpage collector corresponds to one query keyword;
The message queue server also comprises a exchanger, wherein the exchanger is suitable for receiving the URLs of the webpages and storing the URLs of the webpages into a message queue corresponding to the target query keyword;
the content grabber is adapted to obtain URLs of a plurality of web pages related to the query keyword from one message queue in the message queue server, download the URLs of the web pages according to the URLs of the web pages to obtain a plurality of web pages, grab information from the web pages, and generate a web crawler result, wherein each content grabber corresponds to one message queue.
8. The system of claim 7, wherein the web page collector is further adapted to:
constructing a regular expression of the query keyword;
and matching the web pages with the same theme as the query keywords according to the regular expression.
9. The system of claim 8, wherein the content crawler is further adapted to:
and establishing a message channel with one message queue in the message queue server, and acquiring the URLs of a plurality of webpages related to the target query keyword through the message channel.
10. The system of claim 9, wherein the content crawler is further adapted to:
Generating a configuration file according to one query keyword in the web crawler task;
capturing target information for each webpage according to the configuration file to obtain a target information set of query keywords;
and de-duplicating the target information set, and forming the de-duplicated target information set into a web crawler result of the query keyword.
11. The system of claim 10, further comprising a URL parser, a plurality of web page collectors in the system communicatively coupled to the message queue server through the URL parser, the web page collectors further adapted to send URLs of a plurality of web pages to the URL parser;
the URL analyzer is suitable for filtering and re-ranking the URLs of the plurality of webpages sent by the webpage collector according to preset rules, and sending the URLs of the plurality of webpages after filtering and re-ranking to the message queue server.
12. The system of claim 11, further comprising a storage server communicatively coupled to the plurality of content collectors, the storage server adapted to receive and store web crawler results transmitted by the content collection server for easy querying and retrieval by a user.
13. A computing device, comprising:
One or more processors;
a memory; and
one or more devices comprising instructions for performing any of the methods of claims 1-6.
14. A computer readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by a computing device, cause the computing device to perform any of the methods of claims 1-6.
CN202110383241.9A 2021-04-09 2021-04-09 Method, system, computing device and storage medium for realizing web crawler Active CN113239253B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110383241.9A CN113239253B (en) 2021-04-09 2021-04-09 Method, system, computing device and storage medium for realizing web crawler

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110383241.9A CN113239253B (en) 2021-04-09 2021-04-09 Method, system, computing device and storage medium for realizing web crawler

Publications (2)

Publication Number Publication Date
CN113239253A CN113239253A (en) 2021-08-10
CN113239253B true CN113239253B (en) 2024-02-23

Family

ID=77127932

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110383241.9A Active CN113239253B (en) 2021-04-09 2021-04-09 Method, system, computing device and storage medium for realizing web crawler

Country Status (1)

Country Link
CN (1) CN113239253B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113886661A (en) * 2021-12-06 2022-01-04 北京并行科技股份有限公司 Information acquisition method and device and computing equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102314463A (en) * 2010-07-07 2012-01-11 北京瑞信在线系统技术有限公司 Distributed crawler system and webpage data extraction method for the same
CN106649362A (en) * 2015-10-30 2017-05-10 北京国双科技有限公司 Webpage crawling method and apparatus
CN106709052A (en) * 2017-01-06 2017-05-24 电子科技大学 Keyword based topic-focused web crawler design method
CN111814024A (en) * 2020-08-14 2020-10-23 北京斗米优聘科技发展有限公司 Distributed data acquisition method, system and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102314463A (en) * 2010-07-07 2012-01-11 北京瑞信在线系统技术有限公司 Distributed crawler system and webpage data extraction method for the same
CN106649362A (en) * 2015-10-30 2017-05-10 北京国双科技有限公司 Webpage crawling method and apparatus
CN106709052A (en) * 2017-01-06 2017-05-24 电子科技大学 Keyword based topic-focused web crawler design method
CN111814024A (en) * 2020-08-14 2020-10-23 北京斗米优聘科技发展有限公司 Distributed data acquisition method, system and storage medium

Also Published As

Publication number Publication date
CN113239253A (en) 2021-08-10

Similar Documents

Publication Publication Date Title
US20110179002A1 (en) System and Method for a Vector-Space Search Engine
Zubiaga Enhancing navigation on wikipedia with social tags
CN1151457C (en) System and method based on 'Wanwei' net shared search engine inquiry
CN101743530B (en) Method and system for anti-virus scanning of partially available content
JP4724701B2 (en) Text search server computer, text search method, text search program, and recording medium recording the program
CN104077377A (en) Method and device for finding network public opinion hotspots based on network article attributes
JP6720626B2 (en) Removal of outdated items in curated content
US20180253439A1 (en) Characterizing files for similarity searching
Hatzi et al. A specialized search engine for web service discovery
WO2011062685A1 (en) Apparatus and method for loading and updating codes of cluster-based java application system
CN105227610A (en) File uploading and storing method and device
CN113239253B (en) Method, system, computing device and storage medium for realizing web crawler
CN108055351A (en) The processing method and processing device of three dimensional file
CN107506502A (en) A kind of data collecting system and collecting method
KR20170088950A (en) Method and apparatus for providing website authentication data for search engine
CN111241100B (en) Workflow configuration system and method
CN104717286B (en) Data processing method, terminal, server and system
CN107508705B (en) Resource tree construction method of HTTP element and computing equipment
JP5303808B2 (en) Proposing device, proposing system, proposing method, and program
CN112579853A (en) Method and device for sequencing crawling links and storage medium
CN110895582A (en) Data processing method and device
CN107704535A (en) Info web acquisition methods, apparatus and system based on Topic Similarity
JPH10105572A (en) Device and method for grouping documents
JP5212412B2 (en) How to print HTML file
JP2006134169A (en) Search engine system, indexing device, index information relay device and information retrieval method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant