CN106339385A - System for crawling webpages, method for distributing webpage crawling nodes and method for crawling webpages - Google Patents
System for crawling webpages, method for distributing webpage crawling nodes and method for crawling webpages Download PDFInfo
- Publication number
- CN106339385A CN106339385A CN201510397674.4A CN201510397674A CN106339385A CN 106339385 A CN106339385 A CN 106339385A CN 201510397674 A CN201510397674 A CN 201510397674A CN 106339385 A CN106339385 A CN 106339385A
- Authority
- CN
- China
- Prior art keywords
- webpage
- node
- crawl
- webpage capture
- capture
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 62
- 230000009193 crawling Effects 0.000 title abstract 18
- 238000004891 communication Methods 0.000 claims abstract description 10
- 230000004044 response Effects 0.000 claims description 12
- 238000001514 detection method Methods 0.000 claims description 10
- 230000002123 temporal effect Effects 0.000 claims description 7
- 230000008569 process Effects 0.000 claims description 6
- 239000000203 mixture Substances 0.000 claims description 5
- 238000005259 measurement Methods 0.000 claims description 2
- 230000007246 mechanism Effects 0.000 description 6
- 230000005540 biological transmission Effects 0.000 description 5
- 230000003287 optical effect Effects 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 2
- 238000004590 computer program Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000008901 benefit Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000001934 delay Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 238000007689 inspection Methods 0.000 description 1
- 230000035800 maturation Effects 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/955—Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Transfer Between Computers (AREA)
Abstract
The application discloses a system for crawling webpages. The application also discloses a method and device for distributing webpage crawling nodes and a method and device for crawling webpages. The application also relates to two types of electronic devices. The system for crawling webpages comprises at least one master control node, at least one webpage crawling node and a communication network; the master control node and the webpage crawling node are connected through the communication network; the master control node receives obtaining request for webpage crawling nodes, selects and allocates webpage crawling nodes from a list of available webpage crawling nodes that are managed and maintained for different requests of obtaining webpage crawling nodes according to a set rule, and outputs the selected and allocated webpage crawling node information to the request part of the obtaining webpage crawling nodes; the webpage crawling node receives the request for crawling webpages, executes webpage crawling operation and outputs crawled webpages. The system, the method the device or the electronic device can crawl webpages efficiently and timely in large quantities.
Description
Technical field
The application is related to a kind of system of crawl webpage.The application further relates to a kind of webpage capture node distribution side
Method and device, and the method and device of crawl webpage.The application further relates to two kinds of electronic equipments.
Background technology
Developing rapidly with Internet, Internet becomes the carrier of bulk information.It is to utilize state
Resource on the Internet of border, needs to access the webpage downloaded on Internet in large quantities in some cases simultaneously,
Access download webpage and also referred to as capture webpage.The system of existing extensive crawl web page resources is that network is climbed
Worm system, automatically captures the system of webpage on a large scale, it is from the system of one or several Initial pages as one
One URLs starts, and obtains the URL on Initial page, and puts it into net to be captured
In the queue of page, then capture the webpage in the queue of webpage one by one.During crawl webpage, constantly
The queue that new URL is added to webpage to be captured is obtained from current web page, until meeting system
Certain stop condition of system.In addition, all webpages being captured by network crawler system will be stored by system,
Carry out certain analysis, filtration, and set up index, so that inquiry afterwards and retrieval.Existing crawl net
The basic structure of the network crawler system of page is as shown in Figure 1.As can be seen that existing capture net automatically on a large scale
The method of page is concerned with the sustainability crawl for Internet resources, and network crawler system captures the side of webpage
The web page contents of method crawl are needed to process by analysis and then could accessed by the user be arrived by specific interface,
The web page contents that this allows for obtaining can not be readily accessible by the user in real time.And it is specified for needing to capture in real time
Webpage situation, due to existing network crawler system need to crawl webpage be analyzed, filter,
Index etc. is processed it is possible to crawl web page contents can be led to take feelings that are long or even returning web page contents failure
Condition occurs.
And generally execute the real-time process limiting as processor due to performance for the single computer obtaining webpage
Ability, the restriction of the capacity of interface ability to bear or storage, lead to not bear large-scale concurrent webpage capture
Operation.
In sum, also lack the system and method that a kind of large-scale concurrent of maturation captures webpage at present.
Content of the invention
The application provides a kind of system of crawl webpage, to solve the existing method automatically capturing webpage on a large scale
The problem of even return web page contents failure that time-consuming.Additionally, the application also provides a kind of webpage capture node
Distribution method and device.The application also provides a kind of method and device of crawl webpage.The application also provides two
Plant related electronic equipment.
The application provides a kind of system of crawl webpage, including at least one main controlled node, at least one webpage
Crawl node and communication network, are connected by communication network between described main controlled node, webpage capture node,
It is characterized in that,
Main controlled node receives the acquisition request to webpage capture node, from the available webpage capture of institute's management service
In node listing, grabbed with the rule setting as the request selecting distribution webpage of different acquisition webpage capture nodes
Take node, to the information of the webpage capture node of the selected distribution of requesting party's output obtaining webpage capture node;
Webpage capture node receives the request of crawl webpage, execution webpage capture operation, exports the net grabbing
Page.
Alternatively, also include buffer memory device, for receiving and storing the webpage of described webpage capture node crawl
Source code, for each webpage capture node visit;Described buffer memory device also stores what each crawl node was located
The network address accesses the temporal information of the last time of heterogeneous networks main frame.
Alternatively, the web page source code of storage its URL corresponding storage in described buffer memory device.
Alternatively, the web page source code of storage quilt after exceeding the duration threshold value of setting in described buffer memory device
Delete.
Alternatively, the identity identification information of oneself is sent out by described webpage capture node at set time intervals
Give described main controlled node, described main controlled node receives the identity identification information that this webpage capture node is sent
Afterwards, whether the identity identification information according to the webpage capture node wherein comprising, judge this webpage capture node
Already in can be with webpage capture node listing, if it is not, then arrive this webpage capture nodes records can use
In webpage capture node listing.
Alternatively, described main controlled node is at set time intervals into available webpage capture node listing
All webpage capture nodes send detection information, and receive the response from these webpage capture nodes, for
There is no the webpage capture node of response, then in the described available webpage capture node listing of its management service
Delete the record of this webpage capture node.
Alternatively, described identity identification information includes the network address and the process port at webpage capture node place
Number.
The application provides a kind of webpage capture nodes-distributing method, by least one main controlled node of inclusion with least
Main controlled node execution following step in the system of crawl webpage of one webpage capture node;
Receive the request obtaining webpage capture node;
According to the rule setting, in the available webpage capture node listing of institute's management service, it is different asking
Selection is asked to distribute corresponding webpage capture node;
Return the webpage capture address of node information selecting distribution to the side of filing a request.
Alternatively, what described main controlled node reception webpage capture node sent at set time intervals is described
The webpage capture node identity identification information of itself, this main controlled node, according to described identity identification information, judges
The available webpage capture node of this webpage capture node whether Already in described main controlled node institute management service
In list, if it is not, then by this webpage capture nodes records in available webpage capture node listing.
Alternatively, described main controlled node is at set time intervals into available webpage capture node listing
All webpage capture nodes send detection information, and receive the response from these webpage capture nodes, for
There is no the webpage capture node of response, then in the described available webpage capture node listing of its management service
Delete the record of this webpage capture node.
The application provides a kind of method of crawl webpage, is applied in the system capture webpage, described crawl net
The system of page includes at least one webpage capture node, and methods described includes:
Described webpage capture node receives the order of crawl named web page;
The URL comprising in order according to described crawl named web page, crawl named web page
Source code;
The source code being obtained is returned the requesting party of the order of crawl named web page.
Alternatively, described crawl webpage system also include buffer memory device, described according to described crawl specify net
The URL comprising in the order of page, the step of the source code of crawl named web page, comprising:
According to described URL access cache equipment, judge whether to store in buffer memory device described
The source code of the webpage that URL is identified;
If so, then read the described source code of storage in described buffer memory device, as the named web page being captured
Source code;
If it is not, then according to described URL, by webpage described in network access, obtaining this webpage
Source code.
Alternatively, execute described according to described URL, by webpage described in network access,
After the step of the source code obtaining this webpage, execution following step:
Extract the host identification in described URL, current web page captures the network ground that node is located
Location and the time accessing this web page source code, and formed a crawl feature record record described
In buffer memory device;Described buffer memory device passes through storage described crawl feature record realization storage, and each captures node
The network address being located accesses the access time of the last time of heterogeneous networks main frame.
Alternatively, execute described according to described URL, by webpage described in network access,
Before the step of the source code obtaining this webpage, execution following step:
Using the crawl feature record in described buffer memory device, inquire about the network ground at this crawl web page joint place
The access time to the last time of main frame in the URL asked for the location;
Judge whether the access time of described the last time is more than between the access setting apart from the difference of current time
Every threshold value;
If so, then execute described according to described URL, by webpage described in network access, obtain
The step taking the source code of this webpage;
If it is not, after then waiting setting time, returning the described access time distance judging described the last time and work as
The step whether difference of front time is more than the access interval threshold setting.
Alternatively, include can searching keyword and be queried value for described crawl feature record;Described inquire about pass
Keyword captures, by the host identification in described URL, current web page, the network address that node is located
Composition;The described value that is queried is to access the time of described main frame;Described using the crawl in described buffer memory device
Feature record, inquiry current crawl web page joint is nearest to the URL place main frame asked
Access time once is it is simply that capture, using current, the network address that web page joint is located, and is asked to visit
Host identification composition searching keyword in the URL asked, and with this searching keyword as foundation,
It is queried value described in inquiry in described crawl feature record.
Alternatively, described host identification includes Internet protocol address or the domain name of main frame.
Alternatively, the system of described crawl webpage also includes main controlled node, and described webpage capture node is according to setting
The identity identification information of oneself is sent to described main controlled node by fixed time interval.
Alternatively, the system of described crawl webpage also includes main controlled node, and described webpage capture node receives master
The detection information that control node sends, and according to detection information response.
A kind of webpage capture node distribution device that the application provides, comprising:
Receiving unit, for receiving the request obtaining webpage capture node;
Allocation unit, for according to the rule setting, in the available webpage capture node listing of institute's management service
In, it is that different request selecting distribute corresponding webpage capture node;
Returning unit, for returning the webpage capture address of node information selecting distribution to the side of filing a request.
A kind of device of crawl webpage that the application provides, comprising:
Order receiving unit, for receiving the order of crawl named web page;
Placement unit, for the URL comprising in the order according to described crawl named web page,
The source code of crawl named web page;
Webpage returning unit, for returning by the source code being obtained to the requesting party of the order of crawl named web page.
The application provides a kind of method of crawl webpage, is used in and comprises main controlled node, webpage capture node
The system of crawl webpage, wherein main controlled node are used for managing each webpage capture node it is characterised in that wrapping
Include following steps:
Send the acquisition request of webpage capture node to described main controlled node;
Receive the webpage capture address of node information that described main controlled node returns;
According to described webpage capture address of node information, send crawl webpage to described webpage capture node
Request;In the request of described crawl webpage, including at least the URL of specified webpage;
Receive the source code of the named web page that described webpage capture node grabs.
The application provides a kind of device of crawl webpage, is used in and comprises main controlled node, webpage capture node
The system of crawl webpage, wherein main controlled node are used for managing each webpage capture node it is characterised in that wrapping
Include:
Webpage capture node obtains request unit, for sending obtaining of webpage capture node to described main controlled node
Take request;
Webpage capture node address information acquiring unit, for receiving the webpage capture that described main controlled node returns
Address of node information;
Crawl web-page requests transmitting element, sends out for receiving described webpage capture node address information acquiring unit
The webpage capture address of node information sent, according to described webpage capture address of node information, to described net
Page crawl node sends the request of crawl webpage;In the request of described crawl webpage, including at least specified net
The URL of page;
Source code receiving unit, for receiving described webpage capture node with the side described in described claim 9-14
The source code of the named web page that method grabs.
The application provides a kind of electronic equipment it is characterised in that described electronic equipment includes: input equipment,
Outut device, processor and memorizer, described memorizer is used for storing software program, starts this software program,
Can distribute webpage capture node in accordance with the following methods:
Receive the request obtaining webpage capture node;
According to the rule setting, in the available webpage capture node listing of institute's management service, it is different asking
Selection is asked to distribute corresponding webpage capture node;
Return the webpage capture address of node information selecting distribution to the side of filing a request.
The application provides a kind of electronic equipment it is characterised in that described electronic equipment includes: input equipment,
Outut device, processor and memorizer, described memorizer is used for storing software program, starts this software program,
Can capture webpage in accordance with the following methods:
Receive the order of crawl named web page;
The URL comprising in order according to described crawl named web page, crawl named web page
Source code;
The source code being obtained is returned the requesting party of the order of crawl named web page.
Compared with prior art, the application has the advantage that
The application provides the operation that technical scheme can execute crawl webpage in real time directly to return not through processing
Return the webpage of crawl, shorten the response time to crawl web-page requests, ensure that successfully, and in real time,
Rapidly return webpage to be crawled.
Simultaneously because by the way of main controlled node selects distribution webpage capture node, comparing independent computer real
When capture webpage, webpage capture operation substantial amounts of in the short time can be distributed to not by the technical scheme of the application
Execute with webpage capture node.Overcome single computer and cannot be born on a large scale due to the restriction of performance
The problem of concurrent webpage capture operation.Webpage capture node can be in the case of not affecting grasping manipulation, spirit
Increase or decrease alively, be easy to extend or change webpage capture node quantity, play be easy to improve webpage grab
The success rate of extract operation and the effect ensureing webpage capture operation real-time.
In the optimal way of the application, the webpage that grabbed using buffer memory device storage, can save and carry out net
Time and resource that network accesses;In further preferred embodiment, the net of storage in described buffer memory device
Page source code is deleted after the duration threshold value exceeding setting, and this preferred version is able to ensure that and obtains from caching
The information obtaining upgrades in time.
In another optimal way, visited using the network address that each crawl node of buffer memory device storage is located
Ask the temporal information of the last time of heterogeneous networks main frame, realize preventing frequent addressing machine to main frame on the Internet
That makes evades.
Brief description
Fig. 1 is the basic structure schematic diagram of the network crawler system of the application prior art;
Fig. 2 is the structured flowchart of the system of crawl webpage that the application first embodiment provides;
Fig. 3 is the flow chart of the webpage capture nodes-distributing method that the application second embodiment provides;
Fig. 4 is the structured flowchart of the webpage capture node distribution device that the application 3rd embodiment provides;
Fig. 5 is the flow chart of the method for crawl webpage that the application fourth embodiment provides;
Fig. 6 is the structured flowchart of the device of crawl webpage that the application the 5th embodiment provides;
Fig. 7 is the method flow diagram of the crawl webpage that the application sixth embodiment provides;
Fig. 8 is the structured flowchart of the device using crawl webpage that the application the 7th embodiment provides.
Specific embodiment
Elaborate a lot of details in order to fully understand the application in the following description.But the application
Can much to implement different from alternate manner described here, those skilled in the art can without prejudice to
Similar popularization is done, therefore the application is not embodied as being limited by following public in the case of the application intension.
Main controlled node described herein and webpage capture node can be independent equipment such as computer respectively,
Can also be different software programs or software process on same equipment such as computer.
The application first embodiment provides a kind of system of crawl webpage, structured flowchart such as Fig. 2 of this embodiment
Shown.
Refer to Fig. 2, the system of the crawl webpage of the present embodiment includes a main controlled node n101, the first net
Page crawl node n102, the second webpage capture node n103 and communication network n104.
Described main controlled node n101, the first webpage capture node n102, the second webpage capture node n103 it
Between connected by communication network n104.
The list of described one available webpage capture node of main controlled node n101 management service, when receiving acquisition
After the request of webpage capture node, main controlled node n101 is from the available webpage capture node of oneself institute's management service
In list, the rule according to setting selects distribution webpage capture node.
For example, there are the first webpage capture node n102 and the second webpage in currently available webpage capture node listing
The information of crawl two nodes of node n103, and the first webpage capture node n102 is first in list
Position, the second webpage capture node n103 is located at second position in list, can according to the rule setting such as
Select to be located at according to the sequence of positions of webpage capture node in the available webpage capture list of institute's management service every time
The webpage capture node of first position, such as the first webpage capture node n102 are located at available webpage capture node
First position of list, then after selecting the first webpage capture node n102, to acquisition webpage capture node
The address information of the first webpage capture node n102 of the selected distribution of requesting party's output.
After first webpage capture node n102 receives the request of crawl webpage, execution webpage capture operation, output
The webpage grabbing.
Available webpage capture node listing in order to ensure main controlled node n101 institute management service upgrades in time, institute
State main controlled node n101 and the first webpage capture node n102, between the second webpage capture node n103 preferably
Periodically it is mutually authenticated using periodically two-way heartbeat mechanism by communication network.
Described periodically two-way heartbeat mechanism includes: webpage capture node is periodically to described main controlled node
The information of oneself normal work is provided;And, the webpage that described main controlled node periodically actively records to it
Crawl node sends checking information.
, described webpage capture node periodically provides oneself work to described main controlled node taking the present embodiment as a example
Make normal information to realize in the following way: described first webpage capture node n102 and the second webpage capture
Node n103 upon actuation, at set time intervals periodically by the identity identification information of oneself, such as
The internet address being located including this webpage capture node and the information of process port numbers, are sent to described master control
Node n101;Described main controlled node n101 receives described first webpage capture node n102, and the second webpage is grabbed
After taking the identity identification information that node n103 is sent, the identity according to the webpage capture node wherein comprising is known
Other information, judges described first webpage capture node n102, whether the second webpage capture node n103
It is present in available webpage capture node listing.If the first webpage capture node n102 has been saved in can use
In webpage capture node listing, the second webpage capture node n103 does not have, then by this second webpage capture node
N103 recorded in available webpage capture node listing.This mechanism can make main controlled node obtain each can be made
Webpage capture node, it is to avoid omit.
, described main controlled node is periodically actively sent out to the webpage capture node that it records taking the present embodiment as a example
Send the mechanism implementation of checking information as follows.The described main controlled node n101 cycle at set time intervals
Property ground send inspection to all webpage capture nodes in the available webpage capture node listing of oneself institute's management service
Measurement information, preserves the first net as in the available webpage capture node listing of main controlled node n101 institute management service
Page crawl node n102 and the second webpage capture node n103, then main controlled node n101 is according to the time setting
Send detection information respectively to this two nodes to gap periods, if the first webpage capture node n102 rings
Answer detection information and the second webpage capture node n103 does not respond to described detection information, then for not having
Second webpage capture node n103 of response, from the row of the available webpage capture node of main controlled node management service
The record of this second webpage capture node n103 is deleted in table.This mechanism can prevent the net that main controlled node provides
Page crawl node is unavailable.
Under another preferred mode, the system of this crawl webpage can also include the caching shown in figure dotted line and set
Standby n105, described buffer memory device passes through described communication network n104 and main controlled node n101, the first webpage
Crawl node n102, the second webpage capture node n103 are connected.Described buffer memory device n105 is used for receiving
And store the web page source code that described webpage capture node captures, for each webpage capture node visit.Described
The web page source code of storage is identified with URL.And when described web page source code stores caching
After equipment n105, it is deleted after exceeding the duration threshold value of setting.Described buffer memory device also stores each and grabs
The network address that node is located is taken to access the temporal information of the last time of heterogeneous networks main frame.
The system of the crawl webpage in the present embodiment only contains a main controlled node and two webpage capture nodes,
First webpage capture node n102 and the second webpage capture node n103, can be according to grabbing in the middle of practical application
Take the quantity flexible configuration main controlled node of webpage and the quantity of webpage capture node, and different main controlled nodes,
Different webpage capture nodes or even main controlled node and webpage capture node can be arranged on same equipment as calculated
In machine.Transmit letter due to employing periodically two-way heartbeat mechanism between main controlled node and webpage capture node
Breath, the system of this crawl webpage can neatly increase or decrease master control in the case of not affecting normal work
The quantity of node and webpage capture node is to adapt to the needs of practical application.Under preferred mode, due to adopting
Temporarily store, with buffer memory device, the web page source code grabbing and store the network ground that each crawl node is located
Location accesses the temporal information of the last time of heterogeneous networks main frame, can avoid the mistake to webpage place main frame
The real-time of the webpage that guarantee grabs while frequent visit.
The application second embodiment provide a kind of webpage capture nodes-distributing method, its flow chart as shown in figure 3,
The method can be by the system of the crawl webpage comprising at least one main controlled node and at least one webpage capture node
Central main controlled node is implemented.Introduce this webpage capture nodes-distributing method below in conjunction with Fig. 3.
Step s201, receives the request obtaining webpage capture node.
Main controlled node receives the request obtaining webpage capture node.
Step s202, according to the rule setting, in the available webpage capture node listing of institute's management service,
Distribute corresponding webpage capture node for different request selecting.
Main controlled node according to set rule, such as according to net in the available webpage capture list of institute's management service
The order of page crawl node, selects to be located at the webpage capture node of available first position of webpage capture list,
It is assigned as executing the node of crawl webpage.
Step s203, returns the webpage capture address of node information selecting distribution to the side of filing a request.
The webpage capture address of node information of selected distribution is returned to the side of filing a request by main controlled node.
Above embodiments illustrate a kind of webpage capture nodes-distributing method of the application, correspondingly, the application
3rd embodiment provides a kind of webpage capture node distribution device, and its structured flowchart is as shown in Figure 4.This enforcement
A kind of webpage capture node distribution device of example includes: receiving unit u301, allocation unit u302 and return
Unit u303.
Described receiving unit u301, for receiving the request obtaining webpage capture node.
After this unit receives the request obtaining webpage capture node, send enabling signal to allocation unit u302
Instruction allocation unit u302 executes operation.
Described allocation unit u302, for according to the rule setting, in the available webpage capture of institute's management service
In node listing, it is that different request selecting distribute corresponding webpage capture node.
After this unit receives the enabling signal of receiving unit u301 transmission, according to the rule setting, managed
In the available webpage capture node listing that reason is safeguarded, it is that different request selecting distribute corresponding webpage capture section
Point, sends enabling signal instruction returning unit u303 to returning unit u303 and executes operation.
Described returning unit u303, for returning to the ground of the webpage capture node selecting distribution to the side of filing a request
Location information.
After this unit receives the enabling signal of allocation unit u302 transmission, return to the side of filing a request and select to divide
The webpage capture address of node information joined.
The application fourth embodiment provides a kind of method of crawl webpage, and its flow chart is as shown in figure 5, can be by
This includes the webpage capture section in the middle of at least one webpage capture node and the system of crawl webpage of buffer memory device
Point is implemented.
Step s401, receives the order of crawl named web page.
Webpage capture node obtains the order of crawl named web page.
Step s402, the URL comprising in the order according to described crawl named web page, crawl
The source code of named web page.
After webpage capture node gets the order of crawl named web page, crawl can be executed using various ways
The operation of the source code of named web page is it is preferable that following method can be adopted:
Webpage capture node sets according to the URL in the order of crawl named web page, query caching
Whether preserved in standby effectively with the webpage of this URL mark, if having, directly from
The source code of this webpage is captured in buffer memory device.
If there is not the webpage of described URL mark in buffer memory device, by described unified resource
Host identification in finger URL, such as main frame Internet protocol address or domain name, capture node institute with current web page
The network address merge composition key word of the inquiry, with this keyword query buffer memory device crawl feature note
Current web page recorded in record captures the last time of main frame in URL described in node visit
Time.
If the access time of the last time inquiring is more than visit set in advance apart from the difference of current time
Ask interval threshold or in the crawl feature record of buffer memory device, do not find corresponding time record, then currently
Webpage capture node passes through network with specified agreement in described URL, accesses described unification
In URLs, specified network host and path capture the described webpage specified.
If the access time of the last time inquiring apart from current time difference be less than or equal to set in advance
Fixed threshold value, then current web page capture node delays one setting time interval judge again described in inquire
Last access time whether be more than access interval threshold set in advance apart from the difference of current time, until
The described last access time inquiring is more than access interval threshold set in advance apart from the difference of current time
Afterwards, current web page crawl node passes through network again with specified agreement in described URL, visits
Ask that specified network host in described URL and path capture the described webpage specified.
Current web page crawl node passes through network with specified agreement in described URL, accesses
After in described URL, specified network host and path capture the described webpage specified, will refer to
Determine the host identification in the URL of webpage, such as main frame Internet protocol address or domain name, and work as
The network address that front webpage capture node is located together, with access main frame in described URL when
Between formed in the lump one crawl feature record be saved in buffer memory device.So each can be captured node institute
The network address access the temporal information of the last time of heterogeneous networks main frame and be saved in the equipment of being cached to
Crawl feature record.
Step s403, the source code being obtained is returned the requesting party of the order of crawl named web page.
The order that the source code of the webpage specified grabbing returns crawl named web page is asked by webpage capture node
The side of asking.
The method above embodiments illustrating a kind of crawl webpage of the application, except being set using first query caching
Whether preserve in standby beyond the webpage of required crawl, the temporal frequency also webpage place main frame being accessed
Test, not only effectively prevent the too frequent visit to webpage place main frame moreover it is possible to play raising
The success rate of webpage capture and the effect of efficiency.Correspondingly, the application the 5th embodiment provides one kind to grab
Take the device of webpage, its structured flowchart is as shown in Figure 6.
The device of the present embodiment includes: order receiving unit u501, placement unit u502 and webpage return list
First u503.
Described order receiving unit u501, for receiving the order of crawl named web page.
This unit receives the order of crawl named web page, sends enabling signal to placement unit u502, and instruction is grabbed
Unit u502 is taken to execute operation.
Described placement unit u502, for the unified resource comprising in the order according to described crawl named web page
Finger URL, the source code of crawl named web page.
This unit receives the enabling signal that order receiving unit u501 sends, according to described crawl named web page
The URL comprising in order, the source code of crawl named web page, to webpage returning unit u503
Send enabling signal instruction webpage returning unit u503 and execute operation.
Described webpage returning unit u503, for returning by the source code being obtained to the order of crawl named web page
Requesting party.
This unit receives the enabling signal that placement unit u502 sends, and the source code being obtained is returned gripping finger
Determine the requesting party of the order of webpage.
The application sixth embodiment provides a kind of method of crawl webpage, the method be used for comprising main controlled node,
The system of the crawl webpage of webpage capture node, wherein main controlled node is used for managing each webpage capture node.
Its flow chart is as shown in Figure 7.
Step s601, sends the acquisition request of webpage capture node to described main controlled node.
Send the acquisition request of webpage capture node to the main controlled node in the system of crawl webpage.
Step s602, receives the webpage capture address of node information that described main controlled node returns.
Receive the webpage capture address of node information that described main controlled node returns.
Step s603, according to described webpage capture address of node information, sends to described webpage capture node
The request of crawl webpage;In the request of described crawl webpage, the unified resource including at least specified webpage is fixed
Position symbol.
The webpage capture address of node information being returned according to main controlled node, sends out to corresponding webpage capture node
Send the request of crawl webpage.The URL of specified webpage is included in the request of crawl webpage.
Step s604, receives the source code that described webpage capture node grabs named web page.
Receive the source code of the webpage specified that webpage capture node grabs.
The method above embodiments illustrating a kind of crawl webpage of the system of crawl webpage of the application, accordingly
Ground, the application the 7th embodiment provide a kind of crawl webpage device, this device be used in comprise main controlled node,
The system of the crawl webpage of webpage capture node, wherein main controlled node is used for managing each webpage capture node.
Its structured flowchart is as shown in Figure 8.
The device of the present embodiment includes: webpage capture node obtains request unit u701, webpage capture node ground
Location information acquisition unit u702, captures web-page requests transmitting element u703 and source code receiving unit u704
Described webpage capture node obtains request unit u701, for sending webpage capture to described main controlled node
The acquisition request of node.
After this unit sends the acquisition request of webpage capture node to described main controlled node, to webpage capture node
Address information acquiring unit u702 sends enabling signal, indicates webpage capture node address information acquiring unit
U702 executes operation.
Described webpage capture node address information acquiring unit u702, for receiving what described main controlled node returned
Webpage capture address of node information.
After this unit receives the enabling signal that webpage capture node obtains request unit u701 transmission, receive institute
State the webpage capture address of node information of main controlled node return.Send out to crawl web-page requests transmitting element u703
Send enabling signal, instruction crawl web-page requests transmitting element u703 executes operation.
Described crawl web-page requests transmitting element u703, obtains for receiving described webpage capture node address information
Take the webpage capture address of node information that unit sends, according to described webpage capture address of node information,
Send the request of crawl webpage to described webpage capture node;In the request of described crawl webpage, include at least
The URL of specified webpage.
This unit receives the enabling signal that webpage capture node address information acquiring unit u702 sends, and receives institute
State the webpage capture address of node information of webpage capture node address information acquiring unit u702 transmission, according to
Described webpage capture address of node information, sends the request of crawl webpage to described webpage capture node;Institute
State in the request of crawl webpage, including at least the URL of specified webpage.And source code connects backward
Receive unit u704 and send enabling signal, instruction source code receiving unit u704 executes operation.
Described source code receiving unit u704, for receiving the named web page that described webpage capture node grabs
Source code.
After this unit receives the enabling signal that crawl web-page requests transmitting element u703 sends, receive described webpage
The source code of the named web page that crawl node grabs.
The application the 8th embodiment provides a kind of electronic equipment, and this electronic equipment includes: input equipment, output
Equipment, processor and memorizer, described memorizer is used for storing software program, starts this software program, energy
Enough webpage capture nodes that distributes in accordance with the following methods:
Receive the request obtaining webpage capture node;
According to the rule setting, in the available webpage capture node listing of institute's management service, it is different asking
Selection is asked to distribute corresponding webpage capture node;
Return the webpage capture address of node information selecting distribution to the side of filing a request.
The application the 9th embodiment provides a kind of electronic equipment, and this electronic equipment includes: input equipment, output
Equipment, processor and memorizer, described memorizer is used for storing software program, starts this software program, energy
Enough webpages that captures in accordance with the following methods:
Receive the order of crawl named web page;
The URL comprising in order according to described crawl named web page, crawl named web page
Source code;
The source code being obtained is returned the requesting party of the order of crawl named web page.
Although the application is open as above with preferred embodiment, it is not for limiting the application, Ren Heben
Skilled person, without departing from spirit and scope, can make possible variation and modification,
The protection domain of therefore the application should be defined by the scope that the application claim is defined.
In a typical configuration, computer includes one or more processors (cpu), input/output connects
Mouth, network interface and internal memory.
Internal memory potentially includes the volatile memory in computer-readable medium, random access memory
(ram) and/or the form such as Nonvolatile memory, such as read only memory (rom) or flash memory (flash ram).
Internal memory is the example of computer-readable medium.
1st, computer-readable medium include permanent and non-permanent, removable and non-removable media can be by
Any method or technique is realizing information Store.Information can be computer-readable instruction, data structure, journey
The module of sequence or other data.The example of the storage medium of computer includes, but are not limited to phase transition internal memory
(pram), static RAM (sram), dynamic random access memory (dram), its
The random access memory (ram) of his type, read only memory (rom), electrically erasable is read-only deposits
Reservoir (eeprom), fast flash memory bank or other memory techniques, read-only optical disc read only memory (cd-rom),
Digital versatile disc (dvd) or other optical storage, magnetic cassette tape, tape magnetic rigid disk stores or other
Magnetic storage apparatus or any other non-transmission medium, can store the information that can be accessed by a computing device.Press
Define according to herein, computer-readable medium does not include non-temporary computer readable media (transitory media),
Data signal and carrier wave as modulation.
2 it will be understood by those skilled in the art that embodiments herein can be provided as method, system or computer
Program product.Therefore, the application using complete hardware embodiment, complete software embodiment or can combine software
Form with the embodiment of hardware aspect.And, the application can adopt and wherein include meter one or more
Calculation machine usable program code computer-usable storage medium (including but not limited to disk memory, cd-rom,
Optical memory etc.) the upper computer program implemented form.
Claims (24)
1. a kind of system of crawl webpage, including at least one main controlled node, at least one webpage capture node
And communication network, connected by communication network between described main controlled node, webpage capture node, its feature exists
In,
Main controlled node receives the acquisition request to webpage capture node, from the available webpage capture of institute's management service
In node listing, grabbed with the rule setting as the request selecting distribution webpage of different acquisition webpage capture nodes
Take node, to the information of the webpage capture node of the selected distribution of requesting party's output obtaining webpage capture node;
Webpage capture node receives the request of crawl webpage, execution webpage capture operation, exports the net grabbing
Page.
2. according to claim 1 crawl webpage system it is characterised in that also including buffer memory device,
For receiving and storing the web page source code of described webpage capture node crawl, visit for each webpage capture node
Ask;The network address that described buffer memory device also stores each crawl node place accesses heterogeneous networks main frame
Nearly temporal information once.
3. the system of the crawl webpage according to claim 2 is it is characterised in that described buffer memory device
Its URL corresponding storage of the web page source code of middle storage.
4. the system of crawl webpage according to claim 3 is it is characterised in that in described buffer memory device
The web page source code of storage is deleted after the duration threshold value exceeding setting.
5. the system of crawl webpage according to claim 1 is it is characterised in that described webpage capture section
The identity identification information of oneself is sent to described main controlled node by point at set time intervals, described master control
After node receives the identity identification information that this webpage capture node is sent, according to the webpage capture wherein comprising
The identity identification information of node, judges whether this webpage capture node Already in can use webpage capture node
In list, if it is not, then by this webpage capture nodes records in available webpage capture node listing.
6. the system of crawl webpage according to claim 5 is it is characterised in that described main controlled node is pressed
Send detection letter according to the time interval setting to all webpage capture nodes in available webpage capture node listing
Breath, and receives the response from these webpage capture nodes, for the webpage capture node not having response, then
The record of this webpage capture node is deleted in the described available webpage capture node listing of its management service.
7. the system of crawl webpage according to claim 5 is it is characterised in that described identification is believed
Breath includes the network address and the process port numbers at webpage capture node place.
8. a kind of webpage capture nodes-distributing method it is characterised in that by include at least one main controlled node and
Main controlled node execution following step in the system of crawl webpage of at least one webpage capture node;
Receive the request obtaining webpage capture node;
According to the rule setting, in the available webpage capture node listing of institute's management service, it is different asking
Selection is asked to distribute corresponding webpage capture node;
Return the webpage capture address of node information selecting distribution to the side of filing a request.
9. webpage capture nodes-distributing method according to claim 8 is it is characterised in that described master control
Node receives the described webpage capture node body of itself that webpage capture node sends at set time intervals
Part identification information, whether this main controlled node, according to described identity identification information, judges this webpage capture node
Through being present in the available webpage capture node listing of described main controlled node institute management service, if it is not, then should
Webpage capture nodes records are in available webpage capture node listing.
10. webpage capture nodes-distributing method according to claim 8 is it is characterised in that described master
Control node is sent out to all webpage capture nodes in available webpage capture node listing at set time intervals
Censorship measurement information, and receive the response from these webpage capture nodes, for the webpage capture not having response
Node, then delete this webpage capture node in the described available webpage capture node listing of its management service
Record.
A kind of 11. methods of crawl webpage are it is characterised in that being applied in the system capture webpage, described
The system of crawl webpage includes at least one webpage capture node, and methods described includes:
Described webpage capture node receives the order of crawl named web page;
The URL comprising in order according to described crawl named web page, crawl named web page
Source code;
The source code being obtained is returned the requesting party of the order of crawl named web page.
The methods of 12. crawl webpages according to claim 11 are it is characterised in that described crawl webpage
System also include buffer memory device, the unified resource comprising in the described order according to described crawl named web page
Finger URL, the step of the source code of crawl named web page, comprising:
According to described URL access cache equipment, judge whether to store in buffer memory device described
The source code of the webpage that URL is identified;
If so, then read the described source code of storage in described buffer memory device, as the named web page being captured
Source code;
If it is not, then according to described URL, by webpage described in network access, obtaining this webpage
Source code.
The method of 13. crawl webpages according to claim 12 is it is characterised in that executing described
According to described URL, by webpage described in network access, the step obtaining the source code of this webpage
Afterwards, execute following step:
Extract the host identification in described URL, current web page captures the network ground that node is located
Location and the time accessing this web page source code, and formed a crawl feature record record described
In buffer memory device;Described buffer memory device passes through storage described crawl feature record realization storage, and each captures node
The network address being located accesses the access time of the last time of heterogeneous networks main frame.
The method of 14. crawl webpages according to claim 13 is it is characterised in that executing described
According to described URL, by webpage described in network access, the step obtaining the source code of this webpage
Before, execute following step:
Using the crawl feature record in described buffer memory device, inquire about the network ground at this crawl web page joint place
The access time to the last time of main frame in the URL asked for the location;
Judge whether the access time of described the last time is more than between the access setting apart from the difference of current time
Every threshold value;
If so, then execute described according to described URL, by webpage described in network access, obtain
The step taking the source code of this webpage;
If it is not, after then waiting setting time, returning the described access time distance judging described the last time and work as
The step whether difference of front time is more than the access interval threshold setting.
The method of the 15. crawl webpages according to claim 13 or 14 is it is characterised in that described grab
Taking feature record to include can searching keyword and be queried value;Described can searching keyword by described unified resource
Host identification in finger URL, current web page capture the network address composition that node is located;Described it is queried value
For accessing the time of described main frame;Described using the crawl feature record in described buffer memory device, inquiry is current
The access time of crawl web page joint the last time to the URL place main frame asked, just
It is to capture, using current, the network address that web page joint is located, and asked the URL accessing
In host identification composition searching keyword, and with this searching keyword as foundation, in described crawl feature note
It is queried value described in inquiry in record.
The methods of 16. crawl webpages according to claim 13 are it is characterised in that described host identification
Internet protocol address including main frame or domain name.
The methods of 17. crawl webpages according to claim 11 are it is characterised in that described crawl webpage
System also include main controlled node, described webpage capture node is at set time intervals by the identity of oneself
Identification information is sent to described main controlled node.
The methods of 18. crawl webpages according to claim 11 are it is characterised in that described crawl webpage
System also include main controlled node, described webpage capture node receives the detection information that main controlled node sends, and
According to detection information response.
A kind of 19. webpage capture node distribution devices, comprising:
Receiving unit, for receiving the request obtaining webpage capture node;
Allocation unit, for according to the rule setting, in the available webpage capture node listing of institute's management service
In, it is that different request selecting distribute corresponding webpage capture node;
Returning unit, for returning the webpage capture address of node information selecting distribution to the side of filing a request.
A kind of 20. devices of crawl webpage, comprising:
Order receiving unit, for receiving the order of crawl named web page;
Placement unit, for the URL comprising in the order according to described crawl named web page,
The source code of crawl named web page;
Webpage returning unit, for returning by the source code being obtained to the requesting party of the order of crawl named web page.
A kind of 21. methods of crawl webpage, are used in the crawl net comprising main controlled node, webpage capture node
The system of page, wherein main controlled node are used for managing each webpage capture node it is characterised in that including following
Step:
Send the acquisition request of webpage capture node to described main controlled node;
Receive the webpage capture address of node information that described main controlled node returns;
According to described webpage capture address of node information, send crawl webpage to described webpage capture node
Request;In the request of described crawl webpage, including at least the URL of specified webpage;
Receive the source code of the named web page that described webpage capture node grabs.
A kind of 22. devices of crawl webpage, are used in the crawl net comprising main controlled node, webpage capture node
The system of page, wherein main controlled node are used for managing each webpage capture node it is characterised in that including:
Webpage capture node obtains request unit, for sending obtaining of webpage capture node to described main controlled node
Take request;
Webpage capture node address information acquiring unit, for receiving the webpage capture that described main controlled node returns
Address of node information;
Crawl web-page requests transmitting element, sends out for receiving described webpage capture node address information acquiring unit
The webpage capture address of node information sent, according to described webpage capture address of node information, to described net
Page crawl node sends the request of crawl webpage;In the request of described crawl webpage, including at least specified net
The URL of page;
Source code receiving unit, for receiving described webpage capture node with the side described in described claim 9-14
The source code of the named web page that method grabs.
23. a kind of electronic equipments are it is characterised in that described electronic equipment includes: input equipment, output set
Standby, processor and memorizer, described memorizer is used for storing software program, starts this software program, can
Distribution webpage capture node in accordance with the following methods:
Receive the request obtaining webpage capture node;
According to the rule setting, in the available webpage capture node listing of institute's management service, it is different asking
Selection is asked to distribute corresponding webpage capture node;
Return the webpage capture address of node information selecting distribution to the side of filing a request.
24. a kind of electronic equipments are it is characterised in that described electronic equipment includes: input equipment, output set
Standby, processor and memorizer, described memorizer is used for storing software program, starts this software program, can
Crawl webpage in accordance with the following methods:
Receive the order of crawl named web page;
The URL comprising in order according to described crawl named web page, crawl named web page
Source code;
The source code being obtained is returned the requesting party of the order of crawl named web page.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510397674.4A CN106339385B (en) | 2015-07-08 | 2015-07-08 | System for capturing webpage, method for distributing webpage capturing nodes and method for capturing webpage |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510397674.4A CN106339385B (en) | 2015-07-08 | 2015-07-08 | System for capturing webpage, method for distributing webpage capturing nodes and method for capturing webpage |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106339385A true CN106339385A (en) | 2017-01-18 |
CN106339385B CN106339385B (en) | 2020-06-16 |
Family
ID=57827049
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510397674.4A Active CN106339385B (en) | 2015-07-08 | 2015-07-08 | System for capturing webpage, method for distributing webpage capturing nodes and method for capturing webpage |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106339385B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107423382A (en) * | 2017-07-13 | 2017-12-01 | 中国物品编码中心 | network crawling method and device |
CN110442770A (en) * | 2019-08-08 | 2019-11-12 | 深圳市今天国际物流技术股份有限公司 | A kind of data grabber and store method, device, computer equipment and storage medium |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102902669A (en) * | 2011-07-22 | 2013-01-30 | 同程网络科技股份有限公司 | Distribution information capturing method based on internet system |
CN103559219A (en) * | 2013-10-18 | 2014-02-05 | 北京京东尚科信息技术有限公司 | Distributed web crawler capture task dispatching method, dispatching-side device and capture nodes |
CN103970788A (en) * | 2013-02-01 | 2014-08-06 | 北京英富森信息技术有限公司 | Webpage-crawling-based crawler technology |
-
2015
- 2015-07-08 CN CN201510397674.4A patent/CN106339385B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102902669A (en) * | 2011-07-22 | 2013-01-30 | 同程网络科技股份有限公司 | Distribution information capturing method based on internet system |
CN103970788A (en) * | 2013-02-01 | 2014-08-06 | 北京英富森信息技术有限公司 | Webpage-crawling-based crawler technology |
CN103559219A (en) * | 2013-10-18 | 2014-02-05 | 北京京东尚科信息技术有限公司 | Distributed web crawler capture task dispatching method, dispatching-side device and capture nodes |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107423382A (en) * | 2017-07-13 | 2017-12-01 | 中国物品编码中心 | network crawling method and device |
CN110442770A (en) * | 2019-08-08 | 2019-11-12 | 深圳市今天国际物流技术股份有限公司 | A kind of data grabber and store method, device, computer equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN106339385B (en) | 2020-06-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10785322B2 (en) | Server side data cache system | |
CN103970788A (en) | Webpage-crawling-based crawler technology | |
CN102902805B (en) | A kind of page access method and apparatus | |
CN106294352B (en) | A kind of document handling method, device and file system | |
CN105243159A (en) | Visual script editor-based distributed web crawler system | |
CN101482882A (en) | Method and system for cross-domain treatment of COOKIE | |
CN102752288A (en) | Method and device for identifying network access action | |
CN103559300B (en) | The querying method and inquiry unit of data | |
CN106933871A (en) | Short linking processing method, device and short linked server | |
CN105743988B (en) | Network user's tracing implementing method, apparatus and system | |
CN110471949A (en) | Data consanguinity analysis method, apparatus, system, server and storage medium | |
US20240061893A1 (en) | Method, device and computer program for collecting data from multi-domain | |
CN110266661A (en) | A kind of authorization method, device and equipment | |
CN106897336A (en) | Web page files sending method, webpage rendering intent and device, webpage rendering system | |
CN106294826A (en) | A kind of company-data Query method in real time and system | |
CN106302595A (en) | A kind of method and apparatus that server is carried out physical examination | |
KR20180074774A (en) | How to identify malicious websites, devices and computer storage media | |
CN108429785A (en) | A kind of generation method, reptile recognition methods and the device of reptile identification encryption string | |
CN106817388A (en) | The system that virtual machine, host obtain the method, device and access data of data | |
CN104423982A (en) | Request processing method and device | |
CN107580052A (en) | From the network self-adapting reptile method and system of evolution | |
CN110365810A (en) | Domain name caching method, device, equipment and storage medium based on web crawlers | |
CN105791370B (en) | A kind of data processing method and associated server | |
CN106326280A (en) | Data processing method, apparatus and system | |
CN110532455A (en) | A kind of Web page picture acquisition methods and system based on Chrome browser |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right |
Effective date of registration: 20240226 Address after: Room 01, 45th Floor, AXA Building, 8 Shanton Road, Singapore Patentee after: Alibaba Singapore Holdings Ltd. Country or region after: Singapore Address before: A four-storey 847 mailbox in Grand Cayman Capital Building, British Cayman Islands Patentee before: ALIBABA GROUP HOLDING Ltd. Country or region before: Cayman Islands |
|
TR01 | Transfer of patent right |