CN107423382A - network crawling method and device - Google Patents

network crawling method and device Download PDF

Info

Publication number
CN107423382A
CN107423382A CN201710571635.0A CN201710571635A CN107423382A CN 107423382 A CN107423382 A CN 107423382A CN 201710571635 A CN201710571635 A CN 201710571635A CN 107423382 A CN107423382 A CN 107423382A
Authority
CN
China
Prior art keywords
child node
link
task
station address
subtask
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710571635.0A
Other languages
Chinese (zh)
Inventor
罗秋科
林强
张楠
李健华
贾建华
杜景荣
于颖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
ARTICLE NUMBERING CENTER OF CHINA
Original Assignee
ARTICLE NUMBERING CENTER OF CHINA
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by ARTICLE NUMBERING CENTER OF CHINA filed Critical ARTICLE NUMBERING CENTER OF CHINA
Priority to CN201710571635.0A priority Critical patent/CN107423382A/en
Publication of CN107423382A publication Critical patent/CN107423382A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The present invention provides a kind of network crawling method and device.Inventive network crawling method includes:Child node receives the subtask that host node is sent, subtask includes crawling the station address in search groups corresponding to the task type of task and child node, search groups include at least one station address, search groups be host node according to distributed programmed framework map reduce and the task type for crawling task, what is obtained is divided at least one station address;Child node is crawled according to subtask, is crawled what is obtained in data Cun Chudao local storages;Child node is inquired about in local storage, obtains Query Result, and send Query Result to host node.The present invention can realize crawls process to a large amount of web datas.

Description

Network crawling method and device
Technical field
The present invention relates to the communication technology, more particularly to a kind of network crawling method and device.
Background technology
With enriching constantly for Internet resources, increasing platform needs substantial amounts of data supporting to complete accordingly Function.Generally obtaining the channel of data resource includes:Data are obtained after being logged in by hosted platform, are directly connected to other systems Database and data docking is carried out by the way of data-interface.But these channels, which more or less occur, can not obtain correlation The problem of data, higher cost.Therefore, crawl technology using network at present to crawl the data on webpage, in order to flat Platform searches out webpage and related data.
Because curl (CommandLine Uniform Resource Locator) function supports GET, POST etc. to browse Device behavior, the purpose of a simulation browser operation can be reached, it is therefore, usually used in existing network crawling method Curl functions in RCurl program bags complete the crawl process of web data, and then obtain the data on webpage.However, only adopt With the existing network crawling method curl of curl functions can not complete data volume it is larger crawl task.Therefore, a kind of energy is needed badly Enough crawl the network crawling method of mass data.
The content of the invention
The present invention provides a kind of network crawling method and device, can not complete number to solve existing network crawling method Amount amount it is larger the problem of crawling task.
In a first aspect, the present invention provides a kind of network crawling method, system is crawled applied to network, the network, which crawls, is System includes:One host node and multiple child nodes, for any child node, methods described includes:
The child node receives the subtask that the host node is sent, and the subtask includes crawling the task class of task Station address in search groups corresponding to type and the child node, the search groups include at least one station address, institute State search groups be the host node according to distributed programmed framework map-reduce and the task type for crawling task, to institute State at least one station address and divided what is obtained;
The child node is crawled according to the subtask, is crawled what is obtained in data Cun Chudao local storages;
The child node is inquired about in the local storage, obtains Query Result, and send to the host node The Query Result.
Alternatively, the child node is crawled according to the subtask, described to crawl data Cun Chudao sheets by what is obtained In ground memory, including:
The child node carries out traversal connection to the station address in the subtask, obtains the first website of successful connection Address and the second station address of connection failure;
The child node is obtained and linked corresponding to web data page to be crawled in first station address;
The child node is carried out to being linked corresponding to each web data page to be crawled in first station address Traversal connection, obtain the first link of successful connection and the second link of connection failure;
The child node crawls the task type of task according to, to each webpage number corresponding to the described first link According to filtration treatment is carried out, web data corresponding to first link is obtained;
The child node parses to web data corresponding to the described first link, obtains target and crawls data;
The target is crawled data and corresponding first link storage into the local storage by the child node.
Alternatively, methods described also includes:
The child node reconnects second link, and judges whether the child node with described second links connection Success;
If so, then the child node crawls the task type of task according to, to each institute corresponding to the described second link State web data and carry out filtration treatment, obtain web data corresponding to second link, and to corresponding to the described second link Web data is parsed, and is obtained the target and is crawled data, and the target is crawled into data and corresponding second link Store in the local storage;
If it is not, repeating connection second link, and judge whether the child node with described second links connection Successfully operation, if when repeating the number of connection more than the first preset times, the child node deposits the described second link Store up in the local storage.
Alternatively, methods described also includes:
The child node reconnects second station address, and judge the child node whether with second website Address successful connection;
Linked if so, the child node obtains corresponding to web data page to be crawled in second station address;
The child node is carried out to being linked corresponding to each web data page to be crawled in second station address Traversal connection, obtain the 3rd link of successful connection and the 4th link of connection failure;
The child node crawls the task type of task according to, to each webpage number corresponding to the described 3rd link According to filtration treatment is carried out, web data corresponding to the 3rd link is obtained;
The child node parses to web data corresponding to the described 3rd link, obtains the target crawl data;
The target is crawled data and corresponding 3rd link storage into the local storage by the child node;
If it is not, repeat connection second station address, and judge the child node whether with second website The operation of address successful connection, if when repeating the number of connection more than the second preset times, the child node is by described the Two station addresses are stored into the local storage.
Alternatively, methods described also includes:
The child node reconnects the 4th link, and judges whether the child node with the described 4th links connection Success;
If so, then the child node crawls the task type of task according to, to each institute corresponding to the described 4th link State web data and carry out filtration treatment, obtain web data corresponding to the 4th link, and to corresponding to the described 4th link Web data is parsed, and obtains the target crawl data, and the target is crawled into data and corresponding 4th link Store in the local storage;
If it is not, repeating connection the 4th link, and judge whether the child node with the described 4th links connection Successfully operation, if when repeating the number of connection more than three preset times, the child node deposits the described 4th link Store up in the local storage.
Alternatively, state indicating bit is also included in the subtask, the state indicating bit is used to indicate the child node Whether the subtask is performed.
Second aspect, the present invention provide a kind of network crawling method, crawl system applied to network, the network, which crawls, is System includes:One host node and multiple child nodes, methods described include:
The host node obtains the inquiry request of user's input, and the task for the task that crawls is obtained according to the inquiry request Type, the inquiry request correspond at least one station address;
The host node is according to map-reduce and the task type for crawling task, at least one website Location is divided, and obtains at least one search groups, and each search groups include at least one station address, each search groups corresponding one Individual subtask, the corresponding child node in each subtask;
The host node sends each self-corresponding subtask to each child node, and the subtask includes described crawl Station address in the task type of task and each self-corresponding search groups;
The host node receives the Query Result that each child node is sent, and multiple queries result is carried out to collect place Reason, obtains target query result.
Alternatively, also include in the subtask:State indicating bit, the state indicating bit are used to indicate the child node Whether the subtask is performed.
The third aspect, the present invention provide a kind of network and crawl device, crawl system applied to network, the network, which crawls, is System includes:One host node and multiple child nodes, described device include:
Receiving module, the subtask sent for receiving the host node, the subtask include crawling appointing for task Station address in search groups corresponding to service type and the child node, the search groups are with including at least one website Location, the search groups be the host node according to map-reduce and the task type for crawling task, to described at least one Individual station address is divided what is obtained;
Module is crawled, for being crawled according to the subtask, data Cun Chudao local storages are crawled by what is obtained In;
Enquiry module, for being inquired about in the local storage, obtain Query Result;
Sending module, for sending the Query Result to the host node.
Alternatively, the module that crawls is specifically used for:Traversal connection is carried out to the station address in the subtask, obtained First station address of successful connection and the second station address of connection failure;
Obtain and linked corresponding to web data page to be crawled in first station address;
Traversal connection is carried out to link corresponding to each web data page to be crawled in first station address, obtained To the first link of successful connection and the second link of connection failure;
According to the task type for crawling task, each web data corresponding to the described first link is filtered Processing, obtain web data corresponding to first link;
Web data corresponding to described first link is parsed, target is obtained and crawls data;
The target is crawled into data and corresponding first link storage into the local storage.
Alternatively, the module that crawls is specifically used for:Second link is reconnected, and whether judges the child node Successful connection is linked with described second;
If so, then the child node crawls the task type of task according to, to each institute corresponding to the described second link State web data and carry out filtration treatment, obtain web data corresponding to second link, and to corresponding to the described second link Web data is parsed, and is obtained the target and is crawled data, and the target is crawled into data and corresponding second link Store in the local storage;
If it is not, repeating connection second link, and judge whether the child node with described second links connection Successfully operation, if when repeating the number of connection more than the first preset times, the child node deposits the described second link Store up in the local storage.
Alternatively, the module that crawls is specifically used for:Second station address is reconnected, and judges the child node Whether with the second station address successful connection;
Linked if so, the child node obtains corresponding to web data page to be crawled in second station address;
Traversal connection is carried out to link corresponding to each web data page to be crawled in second station address, obtained To the 3rd link of successful connection and the 4th link of connection failure;
According to the task type for crawling task, each web data corresponding to the described 3rd link is filtered Processing, obtain web data corresponding to the 3rd link;
Web data corresponding to described 3rd link is parsed, obtains the target crawl data;
The target is crawled into data and corresponding 3rd link storage into the local storage;
If it is not, repeat connection second station address, and judge the child node whether with second website The operation of address successful connection, if when repeating the number of connection more than the second preset times, the child node is by described the Two station addresses are stored into the local storage.
Alternatively, the module that crawls is specifically used for:The 4th link is reconnected, and whether judges the child node Successful connection is linked with the described 4th;
If so, then the child node crawls the task type of task according to, to each institute corresponding to the described 4th link State web data and carry out filtration treatment, obtain web data corresponding to the 4th link, and to corresponding to the described 4th link Web data is parsed, and obtains the target crawl data, and the target is crawled into data and corresponding 4th link Store in the local storage;
If it is not, repeating connection the 4th link, and judge whether the child node with the described 4th links connection Successfully operation, if when repeating the number of connection more than three preset times, the child node deposits the described 4th link Store up in the local storage.
Fourth aspect, the present invention provide a kind of network and crawl device, crawl system applied to network, the network, which crawls, is System includes:One host node and multiple child nodes, described device include:
Receiving module, for obtaining the inquiry request of user's input, and obtained according to the inquiry request and crawl task Task type, the inquiry request correspond at least one station address;
Division module, for according to map-reduce and the task type for crawling task, at least one net Station address is divided, and obtains at least one search groups, and each search groups include at least one station address, each search groups pair Answer a subtask, the corresponding child node in each subtask;
Sending module, for sending each self-corresponding subtask to each child node, the subtask includes described Crawl task task type and each self-corresponding search groups in station address;
The receiving module, it is additionally operable to the host node and receives the Query Result that each child node is sent, and to multiple Query Result carries out aggregation process, obtains target query result.
Network crawling method and device provided by the invention, this method are obtained by host node according to the inquiry request of user The task type and station address of task are crawled, host node is further according to map-reduce and crawls the task type of task to each net Station address is divided, and is formed subtask corresponding with each child node, each subtask is sent into each self-corresponding child node, respectively Child node subtask corresponding to is crawled, and is crawled what is obtained in data Cun Chudao local storages, further according at this Inquired about in ground memory, obtain Query Result, and Query Result is sent to host node, host node is by receiving each child node Each Query Result is sent, obtains target query result.The present invention can realize climbing for a large amount of web datas by multiple child nodes Take process, not only allow users to it is quick, comprehensively obtain information needed, additionally it is possible to meet that the various of user crawl demand.
Brief description of the drawings
Fig. 1 is the schematic diagram of a scenario of network crawling method provided by the invention;
Fig. 2 is the signaling process figure of network crawling method provided by the invention;
Fig. 3 is the flow chart one of network crawling method provided by the invention;
Fig. 4 is the flowchart 2 of network crawling method provided by the invention;
Fig. 5 is the structural representation one that network provided by the invention crawls device;
Fig. 6 is the structural representation two that network provided by the invention crawls device.
Embodiment
Fig. 1 is the schematic diagram of a scenario of network crawling method provided by the invention, as shown in figure 1, inventive network crawls is System includes:One host node and multiple child nodes.Wherein, host node can use a server, and multiple child nodes are using multiple Server.The system can be applied to the scene of public data acquisition, for example, the system is capable of the production of article raw material, article Information, the quality tracing of article, the information such as sales information of article, the transparence chain letter of one production cycle of article Breath, is easy to user to understand, accurately grasp, to carry out related work.In another example the system is applicable to each school Enter oneself for the examination information, paper publishing information etc..The system is applicable in all trades and professions in life, without using aspectant Mode can all get information by the way of largely searching for so that user can obtain the public data of needs, save The time of user and cost.
With reference to the system shown in Fig. 1, the concrete technical scheme of network crawling method provided by the invention is carried out in detail Describe in detail bright.Fig. 2 is the signaling process figure of network crawling method provided by the invention.Host node can be to more height sections in the present embodiment Point sends subtask corresponding to inquiry request so that whether each child node is inquired about in each self-corresponding local storage and be stored with Target crawls data, and each child node sends each self-corresponding Query Result to host node again, by host node to multiple queries result Carry out aggregation process, to obtain target query result, i.e., disclosed data message needed for user.As shown in Fig. 2 the present embodiment is only Crawl process for the network of host node and any child node and be described in detail, host node and the network of remaining child node crawl Process is identical therewith for process, does not repeat herein.The network crawling method of the present embodiment includes:
S101, host node obtain the inquiry request of user's input, and the task class for the task that crawls is obtained according to inquiry request Type, inquiry request correspond at least one station address.
Specifically, the inquiry request that host node inputs according to user in the present embodiment can not only be analyzed to obtain the task of crawling Task type, additionally it is possible to obtain user and want query process in the enterprising row information in which website.For example, climbed in the present embodiment Information can be entered oneself for the examination for the inquiry request of article raw material, the inquiry request of article quality information, school by taking the task type of task The various situations such as inquiry request.The present embodiment is not limited the specific species for crawling the task type of task, only needs to meet Host node can obtain the task type for the task that crawls according to inquiry request.Moreover, inquiry request is corresponding in the present embodiment Station address can be enterprising in station addresses such as various search engines, school website, special department websites according to user experience Row crawls process, and the present embodiment is not limited the number and species of station address, only need to meet that host node can be according to inquiry Request obtains to station address.
S102, host node are according to distributed programmed framework map-reduce and crawl the task type of task, at least one Individual station address is divided, and obtains at least one search groups, and each search groups include at least one station address, each search The corresponding subtask of group, the corresponding child node in each subtask.
Specifically, host node can analyze use based on map-reduce and the task type for the task that crawls in the present embodiment The quantity of station address in the inquiry request at family, and then the quantity for the station address that can be handled according to each child node is to website Location is divided so that can all be realized in the range of the bearing capacity of each child node and fast and efficiently be captured process.Wherein, originally The number of station address can specifically be drawn according to the bearing capacity of child node in search groups corresponding to each child node in embodiment Point, the station address number of division can be identical, also can be different, and the present embodiment is not limited this.
Further, host node also can determine that grabbing for each child node based on map-reduce and the task type for the task that crawls Take benchmark, crawl time and crawl order etc. so that each child node can carry out each self-corresponding according to specific implementation strategy Subtask.For example, user needs to inquire about the quality information of article, host node can be based on map-reduce and the crawling task of the task Type determines that each child node carries out each self-corresponding subtask according to the subtask implementation strategy of search article bar code.
Further, host node can also control the working condition of each child node, and the present embodiment controls each son to host node The mode of node does not limit.Alternatively, also include in subtask:State indicating bit, state indicating bit are used to indicate child node Whether subtasking.Specifically, host node can by state indicating bit can Real Time Observation to each child node work at present shape State, so that host node can dynamically adjust whether each child node stops or start each self-corresponding subtask.Wherein, if main section Point needs, to subtask corresponding to the distribution of some child node, to indicate that son corresponding to child node execution is appointed by state indicating bit Business;If host node needs to stop subtasking to some child node, it can indicate that the child node stops holding by state indicating bit Subtask corresponding to row.
S103, host node send subtask to child node, and subtask includes crawling the task type of task and search Station address in group.
S104, child node are crawled according to subtask, and the obtained data Cun Chudao that crawls is locally stored with child node In device.
Specifically, because child node and subtask correspond, subtask corresponds with search groups, therefore, host node Each self-corresponding subtask can be sent to each child node, each child node can receive each self-corresponding subtask.Again due to son Task includes the task type crawled and station address, and therefore, each child node can be according to specific implementation strategy to website Web data in location is crawled, so as to obtain crawling data.And each child node can also crawl data by what is each obtained Store in each self-corresponding local storage.In the present embodiment before each child node carries out each self-corresponding subtask, respectively Child node can carry out emptying processing to each self-corresponding local storage.And the present embodiment is to storage to climbing in local storage The concrete form of access evidence does not limit, and only need to meet that each child node can be inquired about in each self-corresponding local storage .
S105, child node are inquired about according to subtask in local storage, obtain Query Result.
S106, child node send Query Result to host node.
The Query Result that S107, host node are sent to each child node carries out aggregation process, obtains target query result.
Specifically, due to being stored with the data that crawl to be checked, therefore, child node in local storage corresponding to child node It can be inquired about according to subtask in corresponding local storage, so as to obtain Query Result, child node is again by Query Result Host node is sent to, the Query Result that host node is sent to each child node received carries out aggregation process, obtains target query As a result, to provide to user is timely, accurate information.
Further, if user also needs inquiry to be asked with current queries request type identical, child node can directly exist Inquired about in corresponding local storage, without carrying out the cumbersome process that crawls to the web data of network address again, saved The time is crawled, and improves the speed of inquiry.
In a specific embodiment, user inputs the inquiry request for obtaining article quality information, main section to host node Point receives corresponding inquiry request, and each child node distributes each self-corresponding subtask backward.Because each child node is each self-corresponding Storage has the quality information of all items in local storage, such as quality inspection number, quality inspection organization, quality inspection time, quality inspection personnel, matter Result etc. is examined, therefore, each child node can be inquired about according to each self-corresponding subtask in respective local storage, be obtained Query Result.Respective Query Result is sent to host node by each child node again, and host node converges to these Query Results again Total processing, obtains the specific quality information of the article.So, because the system causes article quality information transparence, therefore, use Family, which need not carry out aspectant exchange way or substantial amounts of inquiry work, can just grasp article quality information, save user Cost, also improve the operating efficiency of user.
The network crawling method that the present embodiment provides, obtained by host node according to the inquiry request of user and crawl task Task type and station address, host node is further according to map-reduce and crawls the task type of task to the progress of each station address Division, form corresponding with each child node subtask, each subtask be sent to each self-corresponding child node, each child node according to Corresponding subtask is crawled, and is crawled what is obtained in data Cun Chudao local storages, further according in local storage Inquired about, obtain Query Result, and Query Result is sent to host node, host node sends each inquiry by receiving each child node As a result, target query result is obtained.The present embodiment can realize the process that crawls of a large amount of web datas by multiple child nodes, no Only allow users to it is quick, comprehensively obtain information needed, additionally it is possible to meet that the various of user crawl demand.
With reference to Fig. 3 and Fig. 4, crawled for any child node according to subtask, and data are crawled by what is obtained The detailed process stored in corresponding local storage is described in detail.Fig. 3 is network crawling method provided by the invention Flow chart one, Fig. 4 be network crawling method provided by the invention flowchart 2.Because any child node is according to subtask pair Station address carries out traversal connection and two kinds of situations of successful connection and connection failure occurs, therefore, the present embodiment combination Fig. 3 is to even Connect successful situation to be described in detail, the situation of connection failure is described in detail in the present embodiment combination Fig. 4.
On the one hand, as shown in figure 3, the present embodiment network crawling method also includes:
S201, child node carry out traversal connection to the station address in subtask, with obtaining the first website of successful connection Location.
S202, child node are obtained and linked corresponding to web data page to be crawled in the first station address.
S203, child node carry out traversal company to link corresponding to web data page respectively to be crawled in the first station address Connect, obtain the first link of successful connection and the second link of connection failure.
Specifically, for the first station address of successful connection, the present embodiment child node is firstly the need of obtaining the first website All-links corresponding to web data to be crawled in address, then child node it is another one connection all-links, then according to Both connection results, all-links are divided into the first link of successful connection and the second link of connection failure.
Further, at the present embodiment child nodes can be linked using different methods to the first link and second Reason.For the first link, the present embodiment can perform step S204;For the second link, the present embodiment can perform step S20.This Step S204 preferentially step S207 can be performed in embodiment, and step S207 preferentially step S204 can be performed, step S204 and step S207 can also be performed simultaneously, and the present embodiment is not limited step S204 and step S207 execution sequence.
For the first link of successful connection, the present embodiment network crawling method also includes:
S204, child node filter according to the task type for the task that crawls to each web data corresponding to the first link Processing, obtains web data corresponding to the first link.
S205, child node parse to web data corresponding to the first link, obtain target and crawl data.
Target is crawled data and corresponding first link storage into local storage by S206, child node.
Specifically, it can determine that child node carries out the specific implementation strategy of subtask due to crawling the task type of task, Therefore, child node can carry out filtration treatment to each web data corresponding to the first link, filter out and crawl the task type of task Unrelated web data, retain with crawl the task type of task about and meet the web data of standard or specification, be Web data corresponding to first link.For example, child node needs to obtain article bar code, then the webpage number unrelated with article bar code According to can filter out.And because article bar code is 13, if article bar code is 12 on web data, the web data is also filtered Remove.
Further, the present embodiment child nodes parse to web data corresponding to the first link, and will parse what is obtained Target crawls data and corresponding first link is stored in local storage together, is easy to child node to position the target and crawls Data are specifically which chains acquisition at, and each child node to host node transmission by that can be easy to user to search.
For the second link of connection failure, the present embodiment network crawling method also includes:
S207, child node reconnect the second link, and judge whether child node with second links successful connection.If so, Then perform step S208;If it is not, then perform step S209.
S208, child node filter according to the task type for the task that crawls to each web data corresponding to the second link Processing, web data corresponding to the second link is obtained, and web data corresponding to the second link is parsed, obtained target and climb Access evidence, and target is crawled into data and corresponding second link storage into local storage.
Wherein, the implementation such as S208 and S204, S205 in Fig. 3 embodiments and S206 is similar, and the present embodiment is herein not Repeat again.
S209, repeat connection second link, and judge child node whether with the second operation for linking successful connection, if When repeating the number of connection more than the first preset times, then child node stores the second link into local storage.
Specifically, the situation of connection failure occurs due to reasons such as network, parsings, during child node connecting link.It is existing Network crawling method there is no fault tolerant mechanism, therefore the link can not be reconnected.And the present embodiment child nodes also may proceed to The second link of connection failure is connected, more data sources can be provided for the crawl of target web data so that Yong Huneng Access comprehensive information.
Herein it should be noted that the first preset times can be set based on experience value, this implementation is not limited this. And it may include successfully record sheet and error logging table in each each self-corresponding local storage of child node in the present embodiment.Its In, if station address includes multiple addresses, and there are multiple ranks multiple addresses, such as single-level address, two-level address, third-level address Deng the record sheet that then succeeds can be according to rank height by the link in same station address on different stage and corresponding target Crawl data to be stored, error logging table also can be according to rank just by the link in same station address on different stage Carry out enumerating storage;If station address only has one, the record sheet that succeeds is directly to corresponding to station address and the station address Target crawls data and stored, and the station address, error reason can be stored in error logging table and is repeated with connecting the website The information such as the number of location, to facilitate location of mistake.The reason that wherein malfunctions can be network reason, parse reason etc., the present embodiment pair This is not limited.
On the other hand, as shown in figure 4, the present embodiment network crawling method also includes:
S301, child node carry out traversal connection to the station address in subtask, with obtaining the second website of connection failure Location.
Specifically, the situation of connection failure occurs due to reasons such as network, parsings, during child node connecting link.It is existing Network crawling method there is no fault tolerant mechanism, therefore station address can not be reconnected.And the present embodiment child nodes can also be after Second station address of continuous connection connection failure, can provide more data sources for the crawl of target web data so that User can obtain comprehensive information.
S302, child node reconnect the second station address, and judge whether child node connects into the second station address Work(.If so, then perform step S303;If it is not, then perform step S311.
S303, child node are obtained and linked corresponding to web data page to be crawled in the second station address.
S304, child node carry out traversal company to link corresponding to web data page respectively to be crawled in the second station address Connect, obtain the 3rd link of successful connection and the 4th link of connection failure.
Specifically, for the second station address of connection failure, the present embodiment child node is firstly the need of obtaining the second website All-links corresponding to web data to be crawled in address, then child node all link one by one again, then according to two The connection result of person, all-links are divided into the 3rd link of successful connection and the 4th link of connection failure.
Further, at the present embodiment child nodes can be linked using different methods to the 3rd link and the 4th Reason.For the 3rd link, the present embodiment can perform step S305;For the 4th link, the present embodiment can perform step S308, this Step S305 preferentially step S308 can be performed in embodiment, and step S308 preferentially step S305 can be performed, step S305 and step 308 can also be performed simultaneously, and the present embodiment is not limited step S305 and step S308 execution sequence.
For the 3rd link of successful connection, the present embodiment network crawling method also includes:
S305, child node filter according to the task type for the task that crawls to each web data corresponding to the 3rd link Processing, obtain web data corresponding to the 3rd link.
S306, child node parse to web data corresponding to the 3rd link, obtain target crawl data.
Target is crawled data and corresponding 3rd link storage into local storage by S307, child node.
Specifically, it can determine that child node carries out the specific implementation strategy of subtask due to crawling the task type of task, Therefore, child node can carry out filtration treatment to each web data corresponding to the 3rd link, filter out and crawl the task type of task Unrelated web data, retain with crawl the task type of task about and meet the web data of standard or specification, be Web data corresponding to 3rd link.For example, child node needs to obtain article bar code, then the webpage number unrelated with article bar code According to can filter out.And because article bar code is 13, if article bar code is 12 on web data, the web data is also filtered Remove.
Further, the present embodiment child nodes parse to web data corresponding to the 3rd link, and will parse what is obtained Target crawls data and corresponding 3rd link is stored in local storage together, is easy to child node positioning target to crawl number According to being which specifically chains acquisition at, each child node to host node transmission by that can be easy to user to search.
For the second link of connection failure, the present embodiment network crawling method also includes:
S308, child node reconnect the 4th link, and judge whether child node with the 4th links successful connection.If so, Then perform step S309;If it is not, then perform step S310.
S309, child node filter according to the task type for the task that crawls to each web data corresponding to the 4th link Processing, web data corresponding to the 4th link is obtained, and web data corresponding to the 4th link is parsed, obtained target and grab Access evidence, and target is crawled into data and corresponding 4th link storage into local storage.
Wherein, the implementation such as S308 and S305, S306 in Fig. 4 embodiments and S307 is similar, and the present embodiment is herein not Repeat again.
S310, repeat connection the 4th link, and judge child node whether with the 4th operation for linking successful connection, if When repeating the number of connection more than three preset times, then child node stores the 4th link into local storage.
Specifically, the situation of connection failure occurs due to reasons such as network, parsings, during child node connecting link.It is existing Network crawling method there is no fault tolerant mechanism, therefore the link can not be reconnected.And the present embodiment child nodes also may proceed to The 4th link of connection failure is connected, more data sources can be provided for the crawl of target web data so that Yong Huneng Access comprehensive information.
S311, repeat connection the second station address, and judge child node whether with the second station address successful connection Operation, if when repeating the number of connection more than the second preset times, child node is by the storage of the second station address to local In memory.
Specifically, because the situation of connection failure still occurs in the reasons such as network, parsing, child node connection station address. Existing network crawling method does not have fault tolerant mechanism, therefore can not reconnect the link.And the present embodiment child nodes can also Continue the second station address of connection connection failure, more data sources can be provided for the crawl of target web data, made Comprehensive information can be obtained by obtaining user.
Herein it should be noted that the second preset times and the 3rd preset times can all be set based on experience value, and First preset times, the second preset times can be identical with the 3rd preset times, can also differ, this implementation is not limited this. And it may include successfully record sheet and error logging table in each each self-corresponding local storage of child node in the present embodiment.Its In, if station address includes multiple addresses, and there are multiple ranks multiple addresses, such as single-level address, two-level address, third-level address Deng the record sheet that then succeeds can be according to the rule of rank height by the link in same station address on different stage and target Crawl data and enumerate storage, error logging table also can will be not at the same level in same station address according to the rule of rank height Link on not carries out enumerating storage;If station address only has one, the record sheet that succeeds is directly to station address and the website Target crawls data and stored corresponding to address, and the station address, error reason can be stored in error logging table and repeats to connect The information such as the number of the station address are connect, to facilitate location of mistake.The reason that wherein malfunctions can be network reason, parse reason etc., The present embodiment is not limited this.
Fig. 5 is the structural representation one that network provided by the invention crawls device, as shown in figure 5, the present embodiment network is climbed Device 10 is taken to crawl system applied to network, the network, which crawls system, to be included:One host node and multiple child nodes, the net Network, which crawls device 10, to be included:
Receiving module 11, the subtask sent for receiving the host node, the subtask includes crawling task Station address in search groups corresponding to task type and the child node, the search groups are with including at least one website Location, the search groups be the host node according to map-reduce and the task type for crawling task, to described at least one Individual station address is divided what is obtained;
Module 12 is crawled, for being crawled according to the subtask, the obtained data Cun Chudao that crawls is locally stored In device;
Enquiry module 13, for being inquired about in the local storage, obtain Query Result;
Sending module 14, for sending the Query Result to the host node.
Alternatively, the module 12 that crawls is specifically used for:Traversal connection is carried out to the station address in the subtask, obtained To the first station address of successful connection and the second station address of connection failure;
Obtain and linked corresponding to web data page to be crawled in first station address;
Traversal connection is carried out to link corresponding to each web data page to be crawled in first station address, obtained To the first link of successful connection and the second link of connection failure;
According to the task type for crawling task, each web data corresponding to the described first link is filtered Processing, obtain web data corresponding to first link;
Web data corresponding to described first link is parsed, target is obtained and crawls data;
The target is crawled into data and corresponding first link storage into the local storage.
Alternatively, the module 12 that crawls specifically is additionally operable to:Second link is reconnected, and judges the child node Whether with described second successful connection is linked;
If so, the task type of task is then crawled according to, to each web data corresponding to the described second link Filtration treatment is carried out, obtains web data corresponding to second link, and web data corresponding to the described second link is entered Go and parse, obtain the target and crawl data, and the target is crawled into data and corresponding second link storage described in In local storage;
If it is not, repeating connection second link, and judge whether the child node with described second links connection Successfully operation, if when repeating the number of connection more than the first preset times, the child node deposits the described second link Store up in the local storage.
Alternatively, the module 12 that crawls specifically is additionally operable to:Second station address is reconnected, and judges the son Node whether with the second station address successful connection;
Linked if so, the child node obtains corresponding to web data page to be crawled in second station address;
Traversal connection is carried out to link corresponding to each web data page to be crawled in second station address, obtained To the 3rd link of successful connection and the 4th link of connection failure;
According to the task type for crawling task, each web data corresponding to the described 3rd link is filtered Processing, obtain web data corresponding to the 3rd link;
Web data corresponding to described 3rd link is parsed, obtains the target crawl data;
The target is crawled into data and corresponding 3rd link storage into the local storage;
If it is not, repeat connection second station address, and judge the child node whether with second website The operation of address successful connection, if when repeating the number of connection more than the second preset times, the child node is by described the Two station addresses are stored into the local storage.
Alternatively, the module 12 that crawls specifically is additionally operable to:The 4th link is reconnected, and judges the child node Whether with the described 4th successful connection is linked;
If so, then the child node crawls the task type of task according to, to each institute corresponding to the described 4th link State web data and carry out filtration treatment, obtain web data corresponding to the 4th link, and to corresponding to the described 4th link Web data is parsed, and obtains the target crawl data, and the target is crawled into data and corresponding 4th link Store in the local storage;
If it is not, repeating connection the 4th link, and judge whether the child node with the described 4th links connection Successfully operation, if when repeating the number of connection more than three preset times, the child node deposits the described 4th link Store up in the local storage.
Network provided in an embodiment of the present invention crawls device 10, can perform above method embodiment, and it implements principle And technique effect, reference can be made to above method embodiment, here is omitted for the present embodiment.
Fig. 6 is the structural representation two that network provided by the invention crawls device, as shown in fig. 6, the present embodiment network is climbed Device 20 is taken to crawl system applied to network, the network, which crawls system, to be included:One host node and multiple child nodes, the net Network, which crawls device 20, to be included:
Receiving module 21, for obtaining the inquiry request of user's input, and obtained according to the inquiry request and crawl task Task type, the inquiry request corresponds at least one station address;
Division module 22, for according to map-reduce and the task type for crawling task, to described at least one Station address is divided, and obtains at least one search groups, and each search groups include at least one station address, each search groups A corresponding subtask, the corresponding child node in each subtask;
Sending module 23, for sending each self-corresponding subtask to each child node, the subtask includes institute State the task of crawling task type and each self-corresponding search groups in station address;
The receiving module 21, it is additionally operable to the host node and receives the Query Result that each child node is sent, and to more Individual Query Result carries out aggregation process, obtains target query result.
Network provided in an embodiment of the present invention crawls device 20, can perform above method embodiment, and it implements principle And technique effect, reference can be made to above method embodiment, here is omitted for the present embodiment.
One of ordinary skill in the art will appreciate that:Realizing all or part of step of above-mentioned each method embodiment can lead to The related hardware of programmed instruction is crossed to complete.Foregoing program can be stored in a computer read/write memory medium.The journey Sequence upon execution, execution the step of including above-mentioned each method embodiment;And foregoing storage medium includes:ROM, RAM, magnetic disc or Person's CD etc. is various can be with the medium of store program codes.
Finally it should be noted that:Various embodiments above is merely illustrative of the technical solution of the present invention, rather than its limitations;To the greatest extent The present invention is described in detail with reference to foregoing embodiments for pipe, it will be understood by those within the art that:Its according to The technical scheme described in foregoing embodiments can so be modified, either which part or all technical characteristic are entered Row equivalent substitution;And these modifications or replacement, the essence of appropriate technical solution is departed from various embodiments of the present invention technology The scope of scheme.

Claims (10)

1. a kind of network crawling method, system is crawled applied to network, the network, which crawls system, to be included:One host node and more Individual child node, it is characterised in that for any child node, methods described includes:
The child node receives the subtask that the host node is sent, the subtask include crawling the task type of task with And the station address in search groups corresponding to the child node, the search groups include at least one station address, described to search Rope group is the host node according to distributed programmed framework map-reduce and the task type for crawling task, to it is described extremely A few station address is divided what is obtained;
The child node is crawled according to the subtask, is crawled what is obtained in data Cun Chudao local storages;
The child node is inquired about in the local storage, obtains Query Result, and to described in host node transmission Query Result.
2. according to the method for claim 1, it is characterised in that the child node is crawled according to the subtask, institute State and crawled what is obtained in data Cun Chudao local storages, including:
The child node carries out traversal connection to the station address in the subtask, obtains the first station address of successful connection With the second station address of connection failure;
The child node is obtained and linked corresponding to web data page to be crawled in first station address;
The child node travels through to being linked corresponding to each web data page to be crawled in first station address Connection, obtain the first link of successful connection and the second link of connection failure;
The child node crawls the task type of task according to, and each web data corresponding to the described first link is entered Row filtration treatment, obtain web data corresponding to first link;
The child node parses to web data corresponding to the described first link, obtains target and crawls data;
The target is crawled data and corresponding first link storage into the local storage by the child node.
3. according to the method for claim 2, it is characterised in that methods described also includes:
The child node reconnects second link, and judges whether the child node links with described second and connect into Work(;
If so, then the child node crawls the task type of task according to, to each net corresponding to the described second link Page data carries out filtration treatment, obtains web data corresponding to second link, and to webpage corresponding to the described second link Data are parsed, and are obtained the target and are crawled data, and the target is crawled into data and corresponding second link storage Into the local storage;
If it is not, repeating connection second link, and judge whether the child node with described second links successful connection Operation, if repeat connection number more than the first preset times when, the child node by described second link storage arrive In the local storage.
4. according to the method for claim 2, it is characterised in that methods described also includes:
The child node reconnects second station address, and judge the child node whether with second station address Successful connection;
Linked if so, the child node obtains corresponding to web data page to be crawled in second station address;
The child node travels through to being linked corresponding to each web data page to be crawled in second station address Connection, obtain the 3rd link of successful connection and the 4th link of connection failure;
The child node crawls the task type of task according to, and each web data corresponding to the described 3rd link is entered Row filtration treatment, obtain web data corresponding to the 3rd link;
The child node parses to web data corresponding to the described 3rd link, obtains the target crawl data;
The target is crawled data and corresponding 3rd link storage into the local storage by the child node;
If it is not, repeat connection second station address, and judge the child node whether with second station address The operation of successful connection, if when repeating the number of connection more than the second preset times, the child node is by second net Station address is stored into the local storage.
5. according to the method for claim 4, it is characterised in that methods described also includes:
The child node reconnects the 4th link, and judges whether the child node links with the described 4th and connect into Work(;
If so, then the child node crawls the task type of task according to, to each net corresponding to the described 4th link Page data carries out filtration treatment, obtains web data corresponding to the 4th link, and to webpage corresponding to the described 4th link Data are parsed, and obtain the target crawl data, and the target is crawled into data and corresponding 4th link storage Into the local storage;
If it is not, repeating connection the 4th link, and judge whether the child node with the described 4th links successful connection Operation, if repeat connection number more than three preset times when, the child node by the described 4th link storage arrive In the local storage.
6. according to the method for claim 1, it is characterised in that also include state indicating bit, the shape in the subtask State indicating bit is used to indicate whether the child node performs the subtask.
7. a kind of network crawling method, system is crawled applied to network, the network, which crawls system, to be included:One host node and more Individual child node, it is characterised in that methods described includes:
The host node obtains the inquiry request of user's input, and the task class for the task that crawls is obtained according to the inquiry request Type, the inquiry request correspond at least one station address;
The host node enters according to map-reduce and the task type for crawling task at least one station address Row division, obtains at least one search groups, and each search groups include at least one station address, the corresponding son of each search groups Task, the corresponding child node in each subtask;
The host node sends each self-corresponding subtask to each child node, the subtask include described in crawl task Task type and each self-corresponding search groups in station address;
The host node receives the Query Result that each child node is sent, and carries out aggregation process to multiple queries result, obtains To target query result.
8. according to the method for claim 7, it is characterised in that also include in the subtask:State indicating bit, the shape State indicating bit is used to indicate whether the child node performs the subtask.
9. a kind of network crawls device, system is crawled applied to network, the network, which crawls system, to be included:One host node and more Individual child node, it is characterised in that described device includes:
Receiving module, the subtask sent for receiving the host node, the subtask includes crawling the task class of task Station address in search groups corresponding to type and the child node, the search groups include at least one station address, institute State search groups be the host node according to map-reduce and the task type for crawling task, at least one website Address is divided what is obtained;
Module is crawled, for being crawled according to the subtask, is crawled what is obtained in data Cun Chudao local storages;
Enquiry module, for being inquired about in the local storage, obtain Query Result;
Sending module, for sending the Query Result to the host node.
10. a kind of network crawls device, system is crawled applied to network, the network, which crawls system, to be included:One host node and Multiple child nodes, it is characterised in that described device includes:
Receiving module, the task of task of crawling is obtained for obtaining the inquiry request of user's input, and according to the inquiry request Type, the inquiry request correspond at least one station address;
Division module, for according to map-reduce and the task type for crawling task, at least one website Location is divided, and obtains at least one search groups, and each search groups include at least one station address, each search groups corresponding one Individual subtask, the corresponding child node in each subtask;
Sending module, for sending each self-corresponding subtask to each child node, the subtask includes described crawl Station address in the task type of task and each self-corresponding search groups;
The receiving module, it is additionally operable to the host node and receives the Query Result that each child node is sent, and to multiple queries As a result aggregation process is carried out, obtains target query result.
CN201710571635.0A 2017-07-13 2017-07-13 network crawling method and device Pending CN107423382A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710571635.0A CN107423382A (en) 2017-07-13 2017-07-13 network crawling method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710571635.0A CN107423382A (en) 2017-07-13 2017-07-13 network crawling method and device

Publications (1)

Publication Number Publication Date
CN107423382A true CN107423382A (en) 2017-12-01

Family

ID=60426478

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710571635.0A Pending CN107423382A (en) 2017-07-13 2017-07-13 network crawling method and device

Country Status (1)

Country Link
CN (1) CN107423382A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109033269A (en) * 2018-07-10 2018-12-18 卓源信息科技股份有限公司 A kind of Distributed Area talent supply and demand subject data crawling method
CN109088908A (en) * 2018-06-06 2018-12-25 武汉酷犬数据科技有限公司 A kind of the distributed general collecting method and system of network-oriented
CN110297962A (en) * 2019-06-28 2019-10-01 北京金山安全软件有限公司 Website resource crawling method, device, system and computer equipment

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103310012A (en) * 2013-07-02 2013-09-18 北京航空航天大学 Distributed web crawler system
CN103455597A (en) * 2013-09-03 2013-12-18 山东省计算中心 Distributed information hiding detection method facing mass web images
CN104537005A (en) * 2014-12-15 2015-04-22 北京国双科技有限公司 Data processing method and device for webpage crawling
WO2015145455A1 (en) * 2014-03-28 2015-10-01 Hewlett-Packard Development Company, L.P. Resource directory
US9177061B2 (en) * 2007-08-29 2015-11-03 Enpulz, Llc Search engine with geographical verification processing
CN105426407A (en) * 2015-11-02 2016-03-23 浪潮软件集团有限公司 Web data acquisition method based on content analysis
CN106339385A (en) * 2015-07-08 2017-01-18 阿里巴巴集团控股有限公司 System for crawling webpages, method for distributing webpage crawling nodes and method for crawling webpages
CN106682041A (en) * 2015-11-11 2017-05-17 北京国双科技有限公司 Method and device for detecting webpage broken link

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9177061B2 (en) * 2007-08-29 2015-11-03 Enpulz, Llc Search engine with geographical verification processing
CN103310012A (en) * 2013-07-02 2013-09-18 北京航空航天大学 Distributed web crawler system
CN103455597A (en) * 2013-09-03 2013-12-18 山东省计算中心 Distributed information hiding detection method facing mass web images
WO2015145455A1 (en) * 2014-03-28 2015-10-01 Hewlett-Packard Development Company, L.P. Resource directory
CN104537005A (en) * 2014-12-15 2015-04-22 北京国双科技有限公司 Data processing method and device for webpage crawling
CN106339385A (en) * 2015-07-08 2017-01-18 阿里巴巴集团控股有限公司 System for crawling webpages, method for distributing webpage crawling nodes and method for crawling webpages
CN105426407A (en) * 2015-11-02 2016-03-23 浪潮软件集团有限公司 Web data acquisition method based on content analysis
CN106682041A (en) * 2015-11-11 2017-05-17 北京国双科技有限公司 Method and device for detecting webpage broken link

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109088908A (en) * 2018-06-06 2018-12-25 武汉酷犬数据科技有限公司 A kind of the distributed general collecting method and system of network-oriented
CN109033269A (en) * 2018-07-10 2018-12-18 卓源信息科技股份有限公司 A kind of Distributed Area talent supply and demand subject data crawling method
CN110297962A (en) * 2019-06-28 2019-10-01 北京金山安全软件有限公司 Website resource crawling method, device, system and computer equipment
CN110297962B (en) * 2019-06-28 2021-08-24 北京金山安全软件有限公司 Website resource crawling method, device, system and computer equipment

Similar Documents

Publication Publication Date Title
CN104951399B (en) A kind of software testing system and method
CN107885777A (en) A kind of control method and system of the crawl web data based on collaborative reptile
US20170235726A1 (en) Information identification and extraction
CN107423382A (en) network crawling method and device
CN104503891A (en) Method and device for online monitoring JVM (Java Virtual Machine) thread
CN107145556B (en) Universal distributed acquisition system
CN106776693A (en) A kind of website data acquisition method and device
CN105956723A (en) Logistics information management method based on data mining
CN106844730A (en) The display methods and device of file content
Christensen Next-generation catalogues: what do users think
CN109637238A (en) A kind of generation method of exercise, device, equipment and storage medium
CN110222253A (en) A kind of collecting method, equipment and computer readable storage medium
CN103714093B (en) A kind of method for digging and device of the website emphasis page
CN104424188A (en) System and method for updating obtained webpage data
Cheah et al. An ontological approach for program management lessons learned: Case study at motorola penang design centre
Najadat et al. Evaluating Jordanian universities' websites based on data envelopment analysis
Stanford Map your knowledge strategy
CN106294058A (en) The target strategy processing problems of operation document determines method and device
CN115422427A (en) Employment skill requirement analysis system
Murali et al. Crowdsourcing for disaster relief: A multi-platform model
CN106934683A (en) A kind of automatic price comparing method and its robot device
CN114912538A (en) Information push model training method, information push method, device and equipment
CN111078975A (en) Multi-node incremental data acquisition system and acquisition method
CN109948939A (en) A kind of Industrial Solid Waste supervision main body credit evaluation system
Lin Information visualization from the perspective of big data analysis and fusion

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20171201