CN106339385A - System for crawling webpages, method for distributing webpage crawling nodes and method for crawling webpages - Google Patents

System for crawling webpages, method for distributing webpage crawling nodes and method for crawling webpages Download PDF

Info

Publication number
CN106339385A
CN106339385A CN201510397674.4A CN201510397674A CN106339385A CN 106339385 A CN106339385 A CN 106339385A CN 201510397674 A CN201510397674 A CN 201510397674A CN 106339385 A CN106339385 A CN 106339385A
Authority
CN
China
Prior art keywords
webpage
node
crawl
webpage capture
capture
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510397674.4A
Other languages
Chinese (zh)
Other versions
CN106339385B (en
Inventor
苗欣
韩陆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Singapore Holdings Pte Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201510397674.4A priority Critical patent/CN106339385B/en
Publication of CN106339385A publication Critical patent/CN106339385A/en
Application granted granted Critical
Publication of CN106339385B publication Critical patent/CN106339385B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The application discloses a system for crawling webpages. The application also discloses a method and device for distributing webpage crawling nodes and a method and device for crawling webpages. The application also relates to two types of electronic devices. The system for crawling webpages comprises at least one master control node, at least one webpage crawling node and a communication network; the master control node and the webpage crawling node are connected through the communication network; the master control node receives obtaining request for webpage crawling nodes, selects and allocates webpage crawling nodes from a list of available webpage crawling nodes that are managed and maintained for different requests of obtaining webpage crawling nodes according to a set rule, and outputs the selected and allocated webpage crawling node information to the request part of the obtaining webpage crawling nodes; the webpage crawling node receives the request for crawling webpages, executes webpage crawling operation and outputs crawled webpages. The system, the method the device or the electronic device can crawl webpages efficiently and timely in large quantities.

Description

The system of crawl webpage, webpage capture nodes-distributing method, the method for crawl webpage
Technical field
The application is related to a kind of system of crawl webpage.The application further relates to a kind of webpage capture node distribution side Method and device, and the method and device of crawl webpage.The application further relates to two kinds of electronic equipments.
Background technology
Developing rapidly with Internet, Internet becomes the carrier of bulk information.It is to utilize state Resource on the Internet of border, needs to access the webpage downloaded on Internet in large quantities in some cases simultaneously, Access download webpage and also referred to as capture webpage.The system of existing extensive crawl web page resources is that network is climbed Worm system, automatically captures the system of webpage on a large scale, it is from the system of one or several Initial pages as one One URLs starts, and obtains the URL on Initial page, and puts it into net to be captured In the queue of page, then capture the webpage in the queue of webpage one by one.During crawl webpage, constantly The queue that new URL is added to webpage to be captured is obtained from current web page, until meeting system Certain stop condition of system.In addition, all webpages being captured by network crawler system will be stored by system, Carry out certain analysis, filtration, and set up index, so that inquiry afterwards and retrieval.Existing crawl net The basic structure of the network crawler system of page is as shown in Figure 1.As can be seen that existing capture net automatically on a large scale The method of page is concerned with the sustainability crawl for Internet resources, and network crawler system captures the side of webpage The web page contents of method crawl are needed to process by analysis and then could accessed by the user be arrived by specific interface, The web page contents that this allows for obtaining can not be readily accessible by the user in real time.And it is specified for needing to capture in real time Webpage situation, due to existing network crawler system need to crawl webpage be analyzed, filter, Index etc. is processed it is possible to crawl web page contents can be led to take feelings that are long or even returning web page contents failure Condition occurs.
And generally execute the real-time process limiting as processor due to performance for the single computer obtaining webpage Ability, the restriction of the capacity of interface ability to bear or storage, lead to not bear large-scale concurrent webpage capture Operation.
In sum, also lack the system and method that a kind of large-scale concurrent of maturation captures webpage at present.
Content of the invention
The application provides a kind of system of crawl webpage, to solve the existing method automatically capturing webpage on a large scale The problem of even return web page contents failure that time-consuming.Additionally, the application also provides a kind of webpage capture node Distribution method and device.The application also provides a kind of method and device of crawl webpage.The application also provides two Plant related electronic equipment.
The application provides a kind of system of crawl webpage, including at least one main controlled node, at least one webpage Crawl node and communication network, are connected by communication network between described main controlled node, webpage capture node, It is characterized in that,
Main controlled node receives the acquisition request to webpage capture node, from the available webpage capture of institute's management service In node listing, grabbed with the rule setting as the request selecting distribution webpage of different acquisition webpage capture nodes Take node, to the information of the webpage capture node of the selected distribution of requesting party's output obtaining webpage capture node;
Webpage capture node receives the request of crawl webpage, execution webpage capture operation, exports the net grabbing Page.
Alternatively, also include buffer memory device, for receiving and storing the webpage of described webpage capture node crawl Source code, for each webpage capture node visit;Described buffer memory device also stores what each crawl node was located The network address accesses the temporal information of the last time of heterogeneous networks main frame.
Alternatively, the web page source code of storage its URL corresponding storage in described buffer memory device.
Alternatively, the web page source code of storage quilt after exceeding the duration threshold value of setting in described buffer memory device Delete.
Alternatively, the identity identification information of oneself is sent out by described webpage capture node at set time intervals Give described main controlled node, described main controlled node receives the identity identification information that this webpage capture node is sent Afterwards, whether the identity identification information according to the webpage capture node wherein comprising, judge this webpage capture node Already in can be with webpage capture node listing, if it is not, then arrive this webpage capture nodes records can use In webpage capture node listing.
Alternatively, described main controlled node is at set time intervals into available webpage capture node listing All webpage capture nodes send detection information, and receive the response from these webpage capture nodes, for There is no the webpage capture node of response, then in the described available webpage capture node listing of its management service Delete the record of this webpage capture node.
Alternatively, described identity identification information includes the network address and the process port at webpage capture node place Number.
The application provides a kind of webpage capture nodes-distributing method, by least one main controlled node of inclusion with least Main controlled node execution following step in the system of crawl webpage of one webpage capture node;
Receive the request obtaining webpage capture node;
According to the rule setting, in the available webpage capture node listing of institute's management service, it is different asking Selection is asked to distribute corresponding webpage capture node;
Return the webpage capture address of node information selecting distribution to the side of filing a request.
Alternatively, what described main controlled node reception webpage capture node sent at set time intervals is described The webpage capture node identity identification information of itself, this main controlled node, according to described identity identification information, judges The available webpage capture node of this webpage capture node whether Already in described main controlled node institute management service In list, if it is not, then by this webpage capture nodes records in available webpage capture node listing.
Alternatively, described main controlled node is at set time intervals into available webpage capture node listing All webpage capture nodes send detection information, and receive the response from these webpage capture nodes, for There is no the webpage capture node of response, then in the described available webpage capture node listing of its management service Delete the record of this webpage capture node.
The application provides a kind of method of crawl webpage, is applied in the system capture webpage, described crawl net The system of page includes at least one webpage capture node, and methods described includes:
Described webpage capture node receives the order of crawl named web page;
The URL comprising in order according to described crawl named web page, crawl named web page Source code;
The source code being obtained is returned the requesting party of the order of crawl named web page.
Alternatively, described crawl webpage system also include buffer memory device, described according to described crawl specify net The URL comprising in the order of page, the step of the source code of crawl named web page, comprising:
According to described URL access cache equipment, judge whether to store in buffer memory device described The source code of the webpage that URL is identified;
If so, then read the described source code of storage in described buffer memory device, as the named web page being captured Source code;
If it is not, then according to described URL, by webpage described in network access, obtaining this webpage Source code.
Alternatively, execute described according to described URL, by webpage described in network access, After the step of the source code obtaining this webpage, execution following step:
Extract the host identification in described URL, current web page captures the network ground that node is located Location and the time accessing this web page source code, and formed a crawl feature record record described In buffer memory device;Described buffer memory device passes through storage described crawl feature record realization storage, and each captures node The network address being located accesses the access time of the last time of heterogeneous networks main frame.
Alternatively, execute described according to described URL, by webpage described in network access, Before the step of the source code obtaining this webpage, execution following step:
Using the crawl feature record in described buffer memory device, inquire about the network ground at this crawl web page joint place The access time to the last time of main frame in the URL asked for the location;
Judge whether the access time of described the last time is more than between the access setting apart from the difference of current time Every threshold value;
If so, then execute described according to described URL, by webpage described in network access, obtain The step taking the source code of this webpage;
If it is not, after then waiting setting time, returning the described access time distance judging described the last time and work as The step whether difference of front time is more than the access interval threshold setting.
Alternatively, include can searching keyword and be queried value for described crawl feature record;Described inquire about pass Keyword captures, by the host identification in described URL, current web page, the network address that node is located Composition;The described value that is queried is to access the time of described main frame;Described using the crawl in described buffer memory device Feature record, inquiry current crawl web page joint is nearest to the URL place main frame asked Access time once is it is simply that capture, using current, the network address that web page joint is located, and is asked to visit Host identification composition searching keyword in the URL asked, and with this searching keyword as foundation, It is queried value described in inquiry in described crawl feature record.
Alternatively, described host identification includes Internet protocol address or the domain name of main frame.
Alternatively, the system of described crawl webpage also includes main controlled node, and described webpage capture node is according to setting The identity identification information of oneself is sent to described main controlled node by fixed time interval.
Alternatively, the system of described crawl webpage also includes main controlled node, and described webpage capture node receives master The detection information that control node sends, and according to detection information response.
A kind of webpage capture node distribution device that the application provides, comprising:
Receiving unit, for receiving the request obtaining webpage capture node;
Allocation unit, for according to the rule setting, in the available webpage capture node listing of institute's management service In, it is that different request selecting distribute corresponding webpage capture node;
Returning unit, for returning the webpage capture address of node information selecting distribution to the side of filing a request.
A kind of device of crawl webpage that the application provides, comprising:
Order receiving unit, for receiving the order of crawl named web page;
Placement unit, for the URL comprising in the order according to described crawl named web page, The source code of crawl named web page;
Webpage returning unit, for returning by the source code being obtained to the requesting party of the order of crawl named web page.
The application provides a kind of method of crawl webpage, is used in and comprises main controlled node, webpage capture node The system of crawl webpage, wherein main controlled node are used for managing each webpage capture node it is characterised in that wrapping Include following steps:
Send the acquisition request of webpage capture node to described main controlled node;
Receive the webpage capture address of node information that described main controlled node returns;
According to described webpage capture address of node information, send crawl webpage to described webpage capture node Request;In the request of described crawl webpage, including at least the URL of specified webpage;
Receive the source code of the named web page that described webpage capture node grabs.
The application provides a kind of device of crawl webpage, is used in and comprises main controlled node, webpage capture node The system of crawl webpage, wherein main controlled node are used for managing each webpage capture node it is characterised in that wrapping Include:
Webpage capture node obtains request unit, for sending obtaining of webpage capture node to described main controlled node Take request;
Webpage capture node address information acquiring unit, for receiving the webpage capture that described main controlled node returns Address of node information;
Crawl web-page requests transmitting element, sends out for receiving described webpage capture node address information acquiring unit The webpage capture address of node information sent, according to described webpage capture address of node information, to described net Page crawl node sends the request of crawl webpage;In the request of described crawl webpage, including at least specified net The URL of page;
Source code receiving unit, for receiving described webpage capture node with the side described in described claim 9-14 The source code of the named web page that method grabs.
The application provides a kind of electronic equipment it is characterised in that described electronic equipment includes: input equipment, Outut device, processor and memorizer, described memorizer is used for storing software program, starts this software program, Can distribute webpage capture node in accordance with the following methods:
Receive the request obtaining webpage capture node;
According to the rule setting, in the available webpage capture node listing of institute's management service, it is different asking Selection is asked to distribute corresponding webpage capture node;
Return the webpage capture address of node information selecting distribution to the side of filing a request.
The application provides a kind of electronic equipment it is characterised in that described electronic equipment includes: input equipment, Outut device, processor and memorizer, described memorizer is used for storing software program, starts this software program, Can capture webpage in accordance with the following methods:
Receive the order of crawl named web page;
The URL comprising in order according to described crawl named web page, crawl named web page Source code;
The source code being obtained is returned the requesting party of the order of crawl named web page.
Compared with prior art, the application has the advantage that
The application provides the operation that technical scheme can execute crawl webpage in real time directly to return not through processing Return the webpage of crawl, shorten the response time to crawl web-page requests, ensure that successfully, and in real time, Rapidly return webpage to be crawled.
Simultaneously because by the way of main controlled node selects distribution webpage capture node, comparing independent computer real When capture webpage, webpage capture operation substantial amounts of in the short time can be distributed to not by the technical scheme of the application Execute with webpage capture node.Overcome single computer and cannot be born on a large scale due to the restriction of performance The problem of concurrent webpage capture operation.Webpage capture node can be in the case of not affecting grasping manipulation, spirit Increase or decrease alively, be easy to extend or change webpage capture node quantity, play be easy to improve webpage grab The success rate of extract operation and the effect ensureing webpage capture operation real-time.
In the optimal way of the application, the webpage that grabbed using buffer memory device storage, can save and carry out net Time and resource that network accesses;In further preferred embodiment, the net of storage in described buffer memory device Page source code is deleted after the duration threshold value exceeding setting, and this preferred version is able to ensure that and obtains from caching The information obtaining upgrades in time.
In another optimal way, visited using the network address that each crawl node of buffer memory device storage is located Ask the temporal information of the last time of heterogeneous networks main frame, realize preventing frequent addressing machine to main frame on the Internet That makes evades.
Brief description
Fig. 1 is the basic structure schematic diagram of the network crawler system of the application prior art;
Fig. 2 is the structured flowchart of the system of crawl webpage that the application first embodiment provides;
Fig. 3 is the flow chart of the webpage capture nodes-distributing method that the application second embodiment provides;
Fig. 4 is the structured flowchart of the webpage capture node distribution device that the application 3rd embodiment provides;
Fig. 5 is the flow chart of the method for crawl webpage that the application fourth embodiment provides;
Fig. 6 is the structured flowchart of the device of crawl webpage that the application the 5th embodiment provides;
Fig. 7 is the method flow diagram of the crawl webpage that the application sixth embodiment provides;
Fig. 8 is the structured flowchart of the device using crawl webpage that the application the 7th embodiment provides.
Specific embodiment
Elaborate a lot of details in order to fully understand the application in the following description.But the application Can much to implement different from alternate manner described here, those skilled in the art can without prejudice to Similar popularization is done, therefore the application is not embodied as being limited by following public in the case of the application intension.
Main controlled node described herein and webpage capture node can be independent equipment such as computer respectively, Can also be different software programs or software process on same equipment such as computer.
The application first embodiment provides a kind of system of crawl webpage, structured flowchart such as Fig. 2 of this embodiment Shown.
Refer to Fig. 2, the system of the crawl webpage of the present embodiment includes a main controlled node n101, the first net Page crawl node n102, the second webpage capture node n103 and communication network n104.
Described main controlled node n101, the first webpage capture node n102, the second webpage capture node n103 it Between connected by communication network n104.
The list of described one available webpage capture node of main controlled node n101 management service, when receiving acquisition After the request of webpage capture node, main controlled node n101 is from the available webpage capture node of oneself institute's management service In list, the rule according to setting selects distribution webpage capture node.
For example, there are the first webpage capture node n102 and the second webpage in currently available webpage capture node listing The information of crawl two nodes of node n103, and the first webpage capture node n102 is first in list Position, the second webpage capture node n103 is located at second position in list, can according to the rule setting such as Select to be located at according to the sequence of positions of webpage capture node in the available webpage capture list of institute's management service every time The webpage capture node of first position, such as the first webpage capture node n102 are located at available webpage capture node First position of list, then after selecting the first webpage capture node n102, to acquisition webpage capture node The address information of the first webpage capture node n102 of the selected distribution of requesting party's output.
After first webpage capture node n102 receives the request of crawl webpage, execution webpage capture operation, output The webpage grabbing.
Available webpage capture node listing in order to ensure main controlled node n101 institute management service upgrades in time, institute State main controlled node n101 and the first webpage capture node n102, between the second webpage capture node n103 preferably Periodically it is mutually authenticated using periodically two-way heartbeat mechanism by communication network.
Described periodically two-way heartbeat mechanism includes: webpage capture node is periodically to described main controlled node The information of oneself normal work is provided;And, the webpage that described main controlled node periodically actively records to it Crawl node sends checking information.
, described webpage capture node periodically provides oneself work to described main controlled node taking the present embodiment as a example Make normal information to realize in the following way: described first webpage capture node n102 and the second webpage capture Node n103 upon actuation, at set time intervals periodically by the identity identification information of oneself, such as The internet address being located including this webpage capture node and the information of process port numbers, are sent to described master control Node n101;Described main controlled node n101 receives described first webpage capture node n102, and the second webpage is grabbed After taking the identity identification information that node n103 is sent, the identity according to the webpage capture node wherein comprising is known Other information, judges described first webpage capture node n102, whether the second webpage capture node n103 It is present in available webpage capture node listing.If the first webpage capture node n102 has been saved in can use In webpage capture node listing, the second webpage capture node n103 does not have, then by this second webpage capture node N103 recorded in available webpage capture node listing.This mechanism can make main controlled node obtain each can be made Webpage capture node, it is to avoid omit.
, described main controlled node is periodically actively sent out to the webpage capture node that it records taking the present embodiment as a example Send the mechanism implementation of checking information as follows.The described main controlled node n101 cycle at set time intervals Property ground send inspection to all webpage capture nodes in the available webpage capture node listing of oneself institute's management service Measurement information, preserves the first net as in the available webpage capture node listing of main controlled node n101 institute management service Page crawl node n102 and the second webpage capture node n103, then main controlled node n101 is according to the time setting Send detection information respectively to this two nodes to gap periods, if the first webpage capture node n102 rings Answer detection information and the second webpage capture node n103 does not respond to described detection information, then for not having Second webpage capture node n103 of response, from the row of the available webpage capture node of main controlled node management service The record of this second webpage capture node n103 is deleted in table.This mechanism can prevent the net that main controlled node provides Page crawl node is unavailable.
Under another preferred mode, the system of this crawl webpage can also include the caching shown in figure dotted line and set Standby n105, described buffer memory device passes through described communication network n104 and main controlled node n101, the first webpage Crawl node n102, the second webpage capture node n103 are connected.Described buffer memory device n105 is used for receiving And store the web page source code that described webpage capture node captures, for each webpage capture node visit.Described The web page source code of storage is identified with URL.And when described web page source code stores caching After equipment n105, it is deleted after exceeding the duration threshold value of setting.Described buffer memory device also stores each and grabs The network address that node is located is taken to access the temporal information of the last time of heterogeneous networks main frame.
The system of the crawl webpage in the present embodiment only contains a main controlled node and two webpage capture nodes, First webpage capture node n102 and the second webpage capture node n103, can be according to grabbing in the middle of practical application Take the quantity flexible configuration main controlled node of webpage and the quantity of webpage capture node, and different main controlled nodes, Different webpage capture nodes or even main controlled node and webpage capture node can be arranged on same equipment as calculated In machine.Transmit letter due to employing periodically two-way heartbeat mechanism between main controlled node and webpage capture node Breath, the system of this crawl webpage can neatly increase or decrease master control in the case of not affecting normal work The quantity of node and webpage capture node is to adapt to the needs of practical application.Under preferred mode, due to adopting Temporarily store, with buffer memory device, the web page source code grabbing and store the network ground that each crawl node is located Location accesses the temporal information of the last time of heterogeneous networks main frame, can avoid the mistake to webpage place main frame The real-time of the webpage that guarantee grabs while frequent visit.
The application second embodiment provide a kind of webpage capture nodes-distributing method, its flow chart as shown in figure 3, The method can be by the system of the crawl webpage comprising at least one main controlled node and at least one webpage capture node Central main controlled node is implemented.Introduce this webpage capture nodes-distributing method below in conjunction with Fig. 3.
Step s201, receives the request obtaining webpage capture node.
Main controlled node receives the request obtaining webpage capture node.
Step s202, according to the rule setting, in the available webpage capture node listing of institute's management service, Distribute corresponding webpage capture node for different request selecting.
Main controlled node according to set rule, such as according to net in the available webpage capture list of institute's management service The order of page crawl node, selects to be located at the webpage capture node of available first position of webpage capture list, It is assigned as executing the node of crawl webpage.
Step s203, returns the webpage capture address of node information selecting distribution to the side of filing a request.
The webpage capture address of node information of selected distribution is returned to the side of filing a request by main controlled node.
Above embodiments illustrate a kind of webpage capture nodes-distributing method of the application, correspondingly, the application 3rd embodiment provides a kind of webpage capture node distribution device, and its structured flowchart is as shown in Figure 4.This enforcement A kind of webpage capture node distribution device of example includes: receiving unit u301, allocation unit u302 and return Unit u303.
Described receiving unit u301, for receiving the request obtaining webpage capture node.
After this unit receives the request obtaining webpage capture node, send enabling signal to allocation unit u302 Instruction allocation unit u302 executes operation.
Described allocation unit u302, for according to the rule setting, in the available webpage capture of institute's management service In node listing, it is that different request selecting distribute corresponding webpage capture node.
After this unit receives the enabling signal of receiving unit u301 transmission, according to the rule setting, managed In the available webpage capture node listing that reason is safeguarded, it is that different request selecting distribute corresponding webpage capture section Point, sends enabling signal instruction returning unit u303 to returning unit u303 and executes operation.
Described returning unit u303, for returning to the ground of the webpage capture node selecting distribution to the side of filing a request Location information.
After this unit receives the enabling signal of allocation unit u302 transmission, return to the side of filing a request and select to divide The webpage capture address of node information joined.
The application fourth embodiment provides a kind of method of crawl webpage, and its flow chart is as shown in figure 5, can be by This includes the webpage capture section in the middle of at least one webpage capture node and the system of crawl webpage of buffer memory device Point is implemented.
Step s401, receives the order of crawl named web page.
Webpage capture node obtains the order of crawl named web page.
Step s402, the URL comprising in the order according to described crawl named web page, crawl The source code of named web page.
After webpage capture node gets the order of crawl named web page, crawl can be executed using various ways The operation of the source code of named web page is it is preferable that following method can be adopted:
Webpage capture node sets according to the URL in the order of crawl named web page, query caching Whether preserved in standby effectively with the webpage of this URL mark, if having, directly from The source code of this webpage is captured in buffer memory device.
If there is not the webpage of described URL mark in buffer memory device, by described unified resource Host identification in finger URL, such as main frame Internet protocol address or domain name, capture node institute with current web page The network address merge composition key word of the inquiry, with this keyword query buffer memory device crawl feature note Current web page recorded in record captures the last time of main frame in URL described in node visit Time.
If the access time of the last time inquiring is more than visit set in advance apart from the difference of current time Ask interval threshold or in the crawl feature record of buffer memory device, do not find corresponding time record, then currently Webpage capture node passes through network with specified agreement in described URL, accesses described unification In URLs, specified network host and path capture the described webpage specified.
If the access time of the last time inquiring apart from current time difference be less than or equal to set in advance Fixed threshold value, then current web page capture node delays one setting time interval judge again described in inquire Last access time whether be more than access interval threshold set in advance apart from the difference of current time, until The described last access time inquiring is more than access interval threshold set in advance apart from the difference of current time Afterwards, current web page crawl node passes through network again with specified agreement in described URL, visits Ask that specified network host in described URL and path capture the described webpage specified.
Current web page crawl node passes through network with specified agreement in described URL, accesses After in described URL, specified network host and path capture the described webpage specified, will refer to Determine the host identification in the URL of webpage, such as main frame Internet protocol address or domain name, and work as The network address that front webpage capture node is located together, with access main frame in described URL when Between formed in the lump one crawl feature record be saved in buffer memory device.So each can be captured node institute The network address access the temporal information of the last time of heterogeneous networks main frame and be saved in the equipment of being cached to Crawl feature record.
Step s403, the source code being obtained is returned the requesting party of the order of crawl named web page.
The order that the source code of the webpage specified grabbing returns crawl named web page is asked by webpage capture node The side of asking.
The method above embodiments illustrating a kind of crawl webpage of the application, except being set using first query caching Whether preserve in standby beyond the webpage of required crawl, the temporal frequency also webpage place main frame being accessed Test, not only effectively prevent the too frequent visit to webpage place main frame moreover it is possible to play raising The success rate of webpage capture and the effect of efficiency.Correspondingly, the application the 5th embodiment provides one kind to grab Take the device of webpage, its structured flowchart is as shown in Figure 6.
The device of the present embodiment includes: order receiving unit u501, placement unit u502 and webpage return list First u503.
Described order receiving unit u501, for receiving the order of crawl named web page.
This unit receives the order of crawl named web page, sends enabling signal to placement unit u502, and instruction is grabbed Unit u502 is taken to execute operation.
Described placement unit u502, for the unified resource comprising in the order according to described crawl named web page Finger URL, the source code of crawl named web page.
This unit receives the enabling signal that order receiving unit u501 sends, according to described crawl named web page The URL comprising in order, the source code of crawl named web page, to webpage returning unit u503 Send enabling signal instruction webpage returning unit u503 and execute operation.
Described webpage returning unit u503, for returning by the source code being obtained to the order of crawl named web page Requesting party.
This unit receives the enabling signal that placement unit u502 sends, and the source code being obtained is returned gripping finger Determine the requesting party of the order of webpage.
The application sixth embodiment provides a kind of method of crawl webpage, the method be used for comprising main controlled node, The system of the crawl webpage of webpage capture node, wherein main controlled node is used for managing each webpage capture node. Its flow chart is as shown in Figure 7.
Step s601, sends the acquisition request of webpage capture node to described main controlled node.
Send the acquisition request of webpage capture node to the main controlled node in the system of crawl webpage.
Step s602, receives the webpage capture address of node information that described main controlled node returns.
Receive the webpage capture address of node information that described main controlled node returns.
Step s603, according to described webpage capture address of node information, sends to described webpage capture node The request of crawl webpage;In the request of described crawl webpage, the unified resource including at least specified webpage is fixed Position symbol.
The webpage capture address of node information being returned according to main controlled node, sends out to corresponding webpage capture node Send the request of crawl webpage.The URL of specified webpage is included in the request of crawl webpage.
Step s604, receives the source code that described webpage capture node grabs named web page.
Receive the source code of the webpage specified that webpage capture node grabs.
The method above embodiments illustrating a kind of crawl webpage of the system of crawl webpage of the application, accordingly Ground, the application the 7th embodiment provide a kind of crawl webpage device, this device be used in comprise main controlled node, The system of the crawl webpage of webpage capture node, wherein main controlled node is used for managing each webpage capture node. Its structured flowchart is as shown in Figure 8.
The device of the present embodiment includes: webpage capture node obtains request unit u701, webpage capture node ground Location information acquisition unit u702, captures web-page requests transmitting element u703 and source code receiving unit u704
Described webpage capture node obtains request unit u701, for sending webpage capture to described main controlled node The acquisition request of node.
After this unit sends the acquisition request of webpage capture node to described main controlled node, to webpage capture node Address information acquiring unit u702 sends enabling signal, indicates webpage capture node address information acquiring unit U702 executes operation.
Described webpage capture node address information acquiring unit u702, for receiving what described main controlled node returned Webpage capture address of node information.
After this unit receives the enabling signal that webpage capture node obtains request unit u701 transmission, receive institute State the webpage capture address of node information of main controlled node return.Send out to crawl web-page requests transmitting element u703 Send enabling signal, instruction crawl web-page requests transmitting element u703 executes operation.
Described crawl web-page requests transmitting element u703, obtains for receiving described webpage capture node address information Take the webpage capture address of node information that unit sends, according to described webpage capture address of node information, Send the request of crawl webpage to described webpage capture node;In the request of described crawl webpage, include at least The URL of specified webpage.
This unit receives the enabling signal that webpage capture node address information acquiring unit u702 sends, and receives institute State the webpage capture address of node information of webpage capture node address information acquiring unit u702 transmission, according to Described webpage capture address of node information, sends the request of crawl webpage to described webpage capture node;Institute State in the request of crawl webpage, including at least the URL of specified webpage.And source code connects backward Receive unit u704 and send enabling signal, instruction source code receiving unit u704 executes operation.
Described source code receiving unit u704, for receiving the named web page that described webpage capture node grabs Source code.
After this unit receives the enabling signal that crawl web-page requests transmitting element u703 sends, receive described webpage The source code of the named web page that crawl node grabs.
The application the 8th embodiment provides a kind of electronic equipment, and this electronic equipment includes: input equipment, output Equipment, processor and memorizer, described memorizer is used for storing software program, starts this software program, energy Enough webpage capture nodes that distributes in accordance with the following methods:
Receive the request obtaining webpage capture node;
According to the rule setting, in the available webpage capture node listing of institute's management service, it is different asking Selection is asked to distribute corresponding webpage capture node;
Return the webpage capture address of node information selecting distribution to the side of filing a request.
The application the 9th embodiment provides a kind of electronic equipment, and this electronic equipment includes: input equipment, output Equipment, processor and memorizer, described memorizer is used for storing software program, starts this software program, energy Enough webpages that captures in accordance with the following methods:
Receive the order of crawl named web page;
The URL comprising in order according to described crawl named web page, crawl named web page Source code;
The source code being obtained is returned the requesting party of the order of crawl named web page.
Although the application is open as above with preferred embodiment, it is not for limiting the application, Ren Heben Skilled person, without departing from spirit and scope, can make possible variation and modification, The protection domain of therefore the application should be defined by the scope that the application claim is defined.
In a typical configuration, computer includes one or more processors (cpu), input/output connects Mouth, network interface and internal memory.
Internal memory potentially includes the volatile memory in computer-readable medium, random access memory (ram) and/or the form such as Nonvolatile memory, such as read only memory (rom) or flash memory (flash ram). Internal memory is the example of computer-readable medium.
1st, computer-readable medium include permanent and non-permanent, removable and non-removable media can be by Any method or technique is realizing information Store.Information can be computer-readable instruction, data structure, journey The module of sequence or other data.The example of the storage medium of computer includes, but are not limited to phase transition internal memory (pram), static RAM (sram), dynamic random access memory (dram), its The random access memory (ram) of his type, read only memory (rom), electrically erasable is read-only deposits Reservoir (eeprom), fast flash memory bank or other memory techniques, read-only optical disc read only memory (cd-rom), Digital versatile disc (dvd) or other optical storage, magnetic cassette tape, tape magnetic rigid disk stores or other Magnetic storage apparatus or any other non-transmission medium, can store the information that can be accessed by a computing device.Press Define according to herein, computer-readable medium does not include non-temporary computer readable media (transitory media), Data signal and carrier wave as modulation.
2 it will be understood by those skilled in the art that embodiments herein can be provided as method, system or computer Program product.Therefore, the application using complete hardware embodiment, complete software embodiment or can combine software Form with the embodiment of hardware aspect.And, the application can adopt and wherein include meter one or more Calculation machine usable program code computer-usable storage medium (including but not limited to disk memory, cd-rom, Optical memory etc.) the upper computer program implemented form.

Claims (24)

1. a kind of system of crawl webpage, including at least one main controlled node, at least one webpage capture node And communication network, connected by communication network between described main controlled node, webpage capture node, its feature exists In,
Main controlled node receives the acquisition request to webpage capture node, from the available webpage capture of institute's management service In node listing, grabbed with the rule setting as the request selecting distribution webpage of different acquisition webpage capture nodes Take node, to the information of the webpage capture node of the selected distribution of requesting party's output obtaining webpage capture node;
Webpage capture node receives the request of crawl webpage, execution webpage capture operation, exports the net grabbing Page.
2. according to claim 1 crawl webpage system it is characterised in that also including buffer memory device, For receiving and storing the web page source code of described webpage capture node crawl, visit for each webpage capture node Ask;The network address that described buffer memory device also stores each crawl node place accesses heterogeneous networks main frame Nearly temporal information once.
3. the system of the crawl webpage according to claim 2 is it is characterised in that described buffer memory device Its URL corresponding storage of the web page source code of middle storage.
4. the system of crawl webpage according to claim 3 is it is characterised in that in described buffer memory device The web page source code of storage is deleted after the duration threshold value exceeding setting.
5. the system of crawl webpage according to claim 1 is it is characterised in that described webpage capture section The identity identification information of oneself is sent to described main controlled node by point at set time intervals, described master control After node receives the identity identification information that this webpage capture node is sent, according to the webpage capture wherein comprising The identity identification information of node, judges whether this webpage capture node Already in can use webpage capture node In list, if it is not, then by this webpage capture nodes records in available webpage capture node listing.
6. the system of crawl webpage according to claim 5 is it is characterised in that described main controlled node is pressed Send detection letter according to the time interval setting to all webpage capture nodes in available webpage capture node listing Breath, and receives the response from these webpage capture nodes, for the webpage capture node not having response, then The record of this webpage capture node is deleted in the described available webpage capture node listing of its management service.
7. the system of crawl webpage according to claim 5 is it is characterised in that described identification is believed Breath includes the network address and the process port numbers at webpage capture node place.
8. a kind of webpage capture nodes-distributing method it is characterised in that by include at least one main controlled node and Main controlled node execution following step in the system of crawl webpage of at least one webpage capture node;
Receive the request obtaining webpage capture node;
According to the rule setting, in the available webpage capture node listing of institute's management service, it is different asking Selection is asked to distribute corresponding webpage capture node;
Return the webpage capture address of node information selecting distribution to the side of filing a request.
9. webpage capture nodes-distributing method according to claim 8 is it is characterised in that described master control Node receives the described webpage capture node body of itself that webpage capture node sends at set time intervals Part identification information, whether this main controlled node, according to described identity identification information, judges this webpage capture node Through being present in the available webpage capture node listing of described main controlled node institute management service, if it is not, then should Webpage capture nodes records are in available webpage capture node listing.
10. webpage capture nodes-distributing method according to claim 8 is it is characterised in that described master Control node is sent out to all webpage capture nodes in available webpage capture node listing at set time intervals Censorship measurement information, and receive the response from these webpage capture nodes, for the webpage capture not having response Node, then delete this webpage capture node in the described available webpage capture node listing of its management service Record.
A kind of 11. methods of crawl webpage are it is characterised in that being applied in the system capture webpage, described The system of crawl webpage includes at least one webpage capture node, and methods described includes:
Described webpage capture node receives the order of crawl named web page;
The URL comprising in order according to described crawl named web page, crawl named web page Source code;
The source code being obtained is returned the requesting party of the order of crawl named web page.
The methods of 12. crawl webpages according to claim 11 are it is characterised in that described crawl webpage System also include buffer memory device, the unified resource comprising in the described order according to described crawl named web page Finger URL, the step of the source code of crawl named web page, comprising:
According to described URL access cache equipment, judge whether to store in buffer memory device described The source code of the webpage that URL is identified;
If so, then read the described source code of storage in described buffer memory device, as the named web page being captured Source code;
If it is not, then according to described URL, by webpage described in network access, obtaining this webpage Source code.
The method of 13. crawl webpages according to claim 12 is it is characterised in that executing described According to described URL, by webpage described in network access, the step obtaining the source code of this webpage Afterwards, execute following step:
Extract the host identification in described URL, current web page captures the network ground that node is located Location and the time accessing this web page source code, and formed a crawl feature record record described In buffer memory device;Described buffer memory device passes through storage described crawl feature record realization storage, and each captures node The network address being located accesses the access time of the last time of heterogeneous networks main frame.
The method of 14. crawl webpages according to claim 13 is it is characterised in that executing described According to described URL, by webpage described in network access, the step obtaining the source code of this webpage Before, execute following step:
Using the crawl feature record in described buffer memory device, inquire about the network ground at this crawl web page joint place The access time to the last time of main frame in the URL asked for the location;
Judge whether the access time of described the last time is more than between the access setting apart from the difference of current time Every threshold value;
If so, then execute described according to described URL, by webpage described in network access, obtain The step taking the source code of this webpage;
If it is not, after then waiting setting time, returning the described access time distance judging described the last time and work as The step whether difference of front time is more than the access interval threshold setting.
The method of the 15. crawl webpages according to claim 13 or 14 is it is characterised in that described grab Taking feature record to include can searching keyword and be queried value;Described can searching keyword by described unified resource Host identification in finger URL, current web page capture the network address composition that node is located;Described it is queried value For accessing the time of described main frame;Described using the crawl feature record in described buffer memory device, inquiry is current The access time of crawl web page joint the last time to the URL place main frame asked, just It is to capture, using current, the network address that web page joint is located, and asked the URL accessing In host identification composition searching keyword, and with this searching keyword as foundation, in described crawl feature note It is queried value described in inquiry in record.
The methods of 16. crawl webpages according to claim 13 are it is characterised in that described host identification Internet protocol address including main frame or domain name.
The methods of 17. crawl webpages according to claim 11 are it is characterised in that described crawl webpage System also include main controlled node, described webpage capture node is at set time intervals by the identity of oneself Identification information is sent to described main controlled node.
The methods of 18. crawl webpages according to claim 11 are it is characterised in that described crawl webpage System also include main controlled node, described webpage capture node receives the detection information that main controlled node sends, and According to detection information response.
A kind of 19. webpage capture node distribution devices, comprising:
Receiving unit, for receiving the request obtaining webpage capture node;
Allocation unit, for according to the rule setting, in the available webpage capture node listing of institute's management service In, it is that different request selecting distribute corresponding webpage capture node;
Returning unit, for returning the webpage capture address of node information selecting distribution to the side of filing a request.
A kind of 20. devices of crawl webpage, comprising:
Order receiving unit, for receiving the order of crawl named web page;
Placement unit, for the URL comprising in the order according to described crawl named web page, The source code of crawl named web page;
Webpage returning unit, for returning by the source code being obtained to the requesting party of the order of crawl named web page.
A kind of 21. methods of crawl webpage, are used in the crawl net comprising main controlled node, webpage capture node The system of page, wherein main controlled node are used for managing each webpage capture node it is characterised in that including following Step:
Send the acquisition request of webpage capture node to described main controlled node;
Receive the webpage capture address of node information that described main controlled node returns;
According to described webpage capture address of node information, send crawl webpage to described webpage capture node Request;In the request of described crawl webpage, including at least the URL of specified webpage;
Receive the source code of the named web page that described webpage capture node grabs.
A kind of 22. devices of crawl webpage, are used in the crawl net comprising main controlled node, webpage capture node The system of page, wherein main controlled node are used for managing each webpage capture node it is characterised in that including:
Webpage capture node obtains request unit, for sending obtaining of webpage capture node to described main controlled node Take request;
Webpage capture node address information acquiring unit, for receiving the webpage capture that described main controlled node returns Address of node information;
Crawl web-page requests transmitting element, sends out for receiving described webpage capture node address information acquiring unit The webpage capture address of node information sent, according to described webpage capture address of node information, to described net Page crawl node sends the request of crawl webpage;In the request of described crawl webpage, including at least specified net The URL of page;
Source code receiving unit, for receiving described webpage capture node with the side described in described claim 9-14 The source code of the named web page that method grabs.
23. a kind of electronic equipments are it is characterised in that described electronic equipment includes: input equipment, output set Standby, processor and memorizer, described memorizer is used for storing software program, starts this software program, can Distribution webpage capture node in accordance with the following methods:
Receive the request obtaining webpage capture node;
According to the rule setting, in the available webpage capture node listing of institute's management service, it is different asking Selection is asked to distribute corresponding webpage capture node;
Return the webpage capture address of node information selecting distribution to the side of filing a request.
24. a kind of electronic equipments are it is characterised in that described electronic equipment includes: input equipment, output set Standby, processor and memorizer, described memorizer is used for storing software program, starts this software program, can Crawl webpage in accordance with the following methods:
Receive the order of crawl named web page;
The URL comprising in order according to described crawl named web page, crawl named web page Source code;
The source code being obtained is returned the requesting party of the order of crawl named web page.
CN201510397674.4A 2015-07-08 2015-07-08 System for capturing webpage, method for distributing webpage capturing nodes and method for capturing webpage Active CN106339385B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510397674.4A CN106339385B (en) 2015-07-08 2015-07-08 System for capturing webpage, method for distributing webpage capturing nodes and method for capturing webpage

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510397674.4A CN106339385B (en) 2015-07-08 2015-07-08 System for capturing webpage, method for distributing webpage capturing nodes and method for capturing webpage

Publications (2)

Publication Number Publication Date
CN106339385A true CN106339385A (en) 2017-01-18
CN106339385B CN106339385B (en) 2020-06-16

Family

ID=57827049

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510397674.4A Active CN106339385B (en) 2015-07-08 2015-07-08 System for capturing webpage, method for distributing webpage capturing nodes and method for capturing webpage

Country Status (1)

Country Link
CN (1) CN106339385B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107423382A (en) * 2017-07-13 2017-12-01 中国物品编码中心 network crawling method and device
CN110442770A (en) * 2019-08-08 2019-11-12 深圳市今天国际物流技术股份有限公司 A kind of data grabber and store method, device, computer equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102902669A (en) * 2011-07-22 2013-01-30 同程网络科技股份有限公司 Distribution information capturing method based on internet system
CN103559219A (en) * 2013-10-18 2014-02-05 北京京东尚科信息技术有限公司 Distributed web crawler capture task dispatching method, dispatching-side device and capture nodes
CN103970788A (en) * 2013-02-01 2014-08-06 北京英富森信息技术有限公司 Webpage-crawling-based crawler technology

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102902669A (en) * 2011-07-22 2013-01-30 同程网络科技股份有限公司 Distribution information capturing method based on internet system
CN103970788A (en) * 2013-02-01 2014-08-06 北京英富森信息技术有限公司 Webpage-crawling-based crawler technology
CN103559219A (en) * 2013-10-18 2014-02-05 北京京东尚科信息技术有限公司 Distributed web crawler capture task dispatching method, dispatching-side device and capture nodes

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107423382A (en) * 2017-07-13 2017-12-01 中国物品编码中心 network crawling method and device
CN110442770A (en) * 2019-08-08 2019-11-12 深圳市今天国际物流技术股份有限公司 A kind of data grabber and store method, device, computer equipment and storage medium

Also Published As

Publication number Publication date
CN106339385B (en) 2020-06-16

Similar Documents

Publication Publication Date Title
US10785322B2 (en) Server side data cache system
CN103970788A (en) Webpage-crawling-based crawler technology
CN102902805B (en) A kind of page access method and apparatus
CN106294352B (en) A kind of document handling method, device and file system
CN105243159A (en) Visual script editor-based distributed web crawler system
CN101482882A (en) Method and system for cross-domain treatment of COOKIE
CN102752288A (en) Method and device for identifying network access action
CN103559300B (en) The querying method and inquiry unit of data
CN106933871A (en) Short linking processing method, device and short linked server
CN105743988B (en) Network user's tracing implementing method, apparatus and system
CN110471949A (en) Data consanguinity analysis method, apparatus, system, server and storage medium
US20240061893A1 (en) Method, device and computer program for collecting data from multi-domain
CN110266661A (en) A kind of authorization method, device and equipment
CN106897336A (en) Web page files sending method, webpage rendering intent and device, webpage rendering system
CN106294826A (en) A kind of company-data Query method in real time and system
CN106302595A (en) A kind of method and apparatus that server is carried out physical examination
KR20180074774A (en) How to identify malicious websites, devices and computer storage media
CN108429785A (en) A kind of generation method, reptile recognition methods and the device of reptile identification encryption string
CN106817388A (en) The system that virtual machine, host obtain the method, device and access data of data
CN104423982A (en) Request processing method and device
CN107580052A (en) From the network self-adapting reptile method and system of evolution
CN110365810A (en) Domain name caching method, device, equipment and storage medium based on web crawlers
CN105791370B (en) A kind of data processing method and associated server
CN106326280A (en) Data processing method, apparatus and system
CN110532455A (en) A kind of Web page picture acquisition methods and system based on Chrome browser

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20240226

Address after: Room 01, 45th Floor, AXA Building, 8 Shanton Road, Singapore

Patentee after: Alibaba Singapore Holdings Ltd.

Country or region after: Singapore

Address before: A four-storey 847 mailbox in Grand Cayman Capital Building, British Cayman Islands

Patentee before: ALIBABA GROUP HOLDING Ltd.

Country or region before: Cayman Islands

TR01 Transfer of patent right