CN110968770A - Method and device for terminating crawling of crawler tool - Google Patents

Method and device for terminating crawling of crawler tool Download PDF

Info

Publication number
CN110968770A
CN110968770A CN201811145418.6A CN201811145418A CN110968770A CN 110968770 A CN110968770 A CN 110968770A CN 201811145418 A CN201811145418 A CN 201811145418A CN 110968770 A CN110968770 A CN 110968770A
Authority
CN
China
Prior art keywords
crawling
data
crawled
target data
crawler tool
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811145418.6A
Other languages
Chinese (zh)
Other versions
CN110968770B (en
Inventor
张鑫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Gridsum Technology Co Ltd
Original Assignee
Beijing Gridsum Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Gridsum Technology Co Ltd filed Critical Beijing Gridsum Technology Co Ltd
Priority to CN201811145418.6A priority Critical patent/CN110968770B/en
Publication of CN110968770A publication Critical patent/CN110968770A/en
Application granted granted Critical
Publication of CN110968770B publication Critical patent/CN110968770B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method and a device for terminating crawling of a crawler tool, and aims to solve the problem that data crawled by the crawler tool is inaccurate when the crawler tool crawls according to different crawling tasks. The method comprises the following steps: obtaining a crawling result of a crawler tool; judging whether the crawling result meets a termination condition or not, wherein the termination condition can be configured according to the crawling requirement; and if the crawling result meets the termination condition, controlling the crawler tool to finish crawling.

Description

Method and device for terminating crawling of crawler tool
Technical Field
The invention relates to the technical field of data crawling, in particular to a method and a device for terminating crawling of a crawler tool.
Background
Web crawlers, also known as web spiders and web robots, are programs or scripts that automatically capture web information according to certain rules.
The web crawler finishes crawling according to termination conditions in the process of crawling data, such as: determining whether to finish crawling according to the page loading completion result, determining whether to finish crawling according to the page turning times, or determining whether to finish crawling according to the crawling depth.
However, the traditional termination condition is relatively rigid, and when crawling is performed according to different crawling tasks, the web crawler can crawl more data except the target data or crawl less partial data in the target data, so that the data crawled by the web crawler is inaccurate.
Disclosure of Invention
In view of the foregoing problems, an object of the embodiments of the present invention is to provide a method and an apparatus for terminating crawling by a crawler tool, so as to solve the problem that data crawled by the crawler tool is inaccurate when the crawler tool crawls according to different crawling tasks.
In order to solve the above technical problems, embodiments of the present invention provide the following technical solutions:
in a first aspect, an embodiment of the present invention provides a method for terminating crawling by a crawler, the method including: obtaining a crawling result of a crawler tool; judging whether the crawling result meets a termination condition or not, wherein the termination condition can be configured according to the crawling requirement; and if the crawling result meets the termination condition, controlling the crawler tool to finish crawling.
In other embodiments of the present invention, the determining whether the crawling result meets the termination condition includes: when the crawling result comprises crawled data, judging whether the crawled data comprises target data; if yes, judging that the crawling result meets a termination condition; and/or when the crawling result comprises a crawling parameter, judging whether the value of the crawling parameter reaches a preset value; and if so, judging that the crawling result meets a termination condition.
In other embodiments of the present invention, the determining whether the crawled data includes target data includes: respectively obtaining the characteristics of the crawled data and the characteristics of the target data; determining whether the crawled data comprises the target data or not according to a comparison result of the characteristics of the crawled data and the characteristics of the target data; if the comparison result is that the features of the crawled data are matched with the features of the target data, determining that the crawled data comprise the target data; and if the comparison result is that the features of the crawled data are not matched with the features of the target data, determining that the crawled data do not comprise the target data.
In other embodiments of the present invention, the determining whether the crawled data includes target data includes: determining whether the crawled data comprises the target data according to whether preset content is obtained or not, wherein the preset content is generated in a current page after the crawler tool crawls the target data; if the preset content is obtained, determining that the crawled data comprises the target data; and if the preset content is not obtained, determining that the crawled data does not comprise the target data.
In other embodiments of the present invention, the preset content includes: at least one of a page element and page request data.
In other embodiments of the present invention, the crawling parameters comprise: the number of single page operations, the number of page turning times, the crawling path depth, the number of pages to be crawled by the crawler tool at the next node, and at least one of the ratio of the number of nodes behind the node where the crawler tool is located to the total number of nodes.
In a second aspect, an embodiment of the present invention provides an apparatus for terminating crawling by a crawler, the apparatus comprising: an acquisition module configured to obtain a crawling result of a crawler tool; the judging module is configured to judge whether the crawling result meets a termination condition, and the termination condition can be configured according to crawling requirements; a control module configured to control the crawler tool to end crawling if the crawling result satisfies the termination condition.
In other embodiments of the present invention, the determining module is configured to determine whether the crawled data includes target data when the crawling result includes the crawled data; if yes, judging that the crawling result meets a termination condition; and/or judging whether the value of the crawling parameter reaches a preset value or not when the crawling result comprises the crawling parameter; and if so, judging that the crawling result meets the termination condition.
In other embodiments of the present invention, the determining module is configured to obtain the feature of the crawled data and the feature of the target data respectively; determining whether the crawled data comprises target data or not according to a comparison result of the characteristics of the crawled data and the characteristics of the target data; if the comparison result is that the features of the crawled data are matched with the features of the target data, determining that the crawled data comprise the target data; and if the comparison result is that the features of the crawled data are not matched with the features of the target data, determining that the crawled data do not comprise the target data.
In other embodiments of the present invention, the determining module is configured to determine whether the crawled data includes target data according to whether preset content is obtained, where the preset content is generated in a current page after a crawler tool crawls the target data; if the preset content is obtained, determining that the crawled data comprises target data; and if the preset content is not obtained, determining that the crawled data does not comprise the target data.
In a third aspect, an embodiment of the present invention provides an electronic device, including: memory, processor and computer program stored on the memory and executable on the processor, the processor implementing the method as described in one or more of the above claims when executing the program.
In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the method described in one or more of the above technical solutions.
The method and the device for terminating crawling of the crawler tool provided by the embodiment of the invention comprise the following steps of firstly, obtaining a crawling result of the crawler tool; then, judging whether the crawling result meets a termination condition configured according to the crawling requirement; and finally, if the crawling result meets the termination condition, controlling the crawler tool to finish crawling. It is thus clear that dispose termination condition according to crawling the demand, can make the result of crawling of crawler tool finish crawling when satisfying termination condition, that is to say, can make the crawler tool finish crawling after satisfying the crawling demand, avoid the crawler tool to crawl more the above-mentioned data that do not need in the crawling demand, or crawl the above-mentioned partial data that need in the crawling demand less, can improve the accuracy of the data that the crawler tool crawled.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained according to the drawings without creative efforts for those skilled in the art.
FIG. 1 is a schematic flow diagram of a method of terminating crawling of crawler tools in an embodiment of the present invention;
FIG. 2 is a schematic structural diagram of an apparatus for terminating crawling by a crawler tool in an embodiment of the present invention;
fig. 3 is a schematic structural diagram of an electronic device in an embodiment of the present invention.
Detailed Description
The technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. It is to be understood that the embodiments described are only a few embodiments of the present invention, and not all embodiments. Other embodiments may be derived from these embodiments by those of ordinary skill in the art without the exercise of inventive faculty.
The embodiment of the invention provides a method for terminating crawling of a crawler tool, which can be applied to a process of crawling of a purposeful crawler tool in practical application, for example: crawling stock information, crawling weather information, crawling news, and the like. The crawler tool can configure the termination condition according to the crawling demand in the process of crawling purposefully, and the crawling result of the crawler tool meets the crawling demand, so that the crawling is finished, and the accuracy of crawling data of the purposive crawler tool can be improved.
The method for terminating crawling by the crawler tool provided by the embodiment of the invention is explained in the following with reference to fig. 1.
Fig. 1 is a schematic flow chart of a method for terminating crawling by a crawler tool in an embodiment of the present invention, and referring to fig. 1, the method includes:
s110: and obtaining a crawling result of the crawler tool.
Wherein, the execution subject of the method for terminating crawling by the crawler tool can be the crawler tool itself, and the crawler tool can be a tool with a data crawling function, for example: the web crawler, taking the crawler tool itself as an execution main body, can improve the crawling performance of the crawler tool itself, namely can improve the accuracy of crawling data of the crawler tool; the execution main body of the method for terminating crawling of the crawler tool can also be a computer program except the crawler tool, and the computer program is used as the execution main body, so that the accuracy of crawling of the crawler tool can be improved, and the crawler tool can be prevented from occupying too many space resources.
Here, the crawler tool crawls according to the crawling task, and the crawling result of the crawler tool may be data crawled by the crawler tool according to the crawling task, or may be a crawling parameter when the crawler tool crawls according to the crawling task, for example: the number of single page operations, the number of page turning times, the crawling path depth, the number of pages to be crawled by the crawler tool at the next node, and one or more of the ratio of the number of nodes behind the node where the crawler tool is currently located to the total number of nodes.
S120: and judging whether the crawling result meets a termination condition.
Wherein, the termination condition can be disposed before the crawler instrument crawls according to the task of crawling at every turn to make the data that crawler instrument crawled according to the different tasks of crawling accord with the requirement of every task of crawling respectively. When the termination condition is configured, the configuration can be carried out according to the termination condition which is configured in advance in the crawling task which needs to be executed currently; when the crawling task has no preconfigured termination condition, the termination condition can be configured according to the crawling task; the termination condition may also be configured directly according to the crawling requirement, and a reference object when the termination condition is configured is not specifically limited herein.
Specifically, a plurality of termination conditions may be preset in the crawler tool, and then the user selects one or more termination conditions among the termination conditions preset in the crawler tool according to the crawling task before crawling using the crawler tool. For example: and setting a termination condition A, a termination condition B and a termination condition C in the crawler tool in advance, wherein the termination condition configured in advance in the crawling task D is the termination condition A, and then before the crawler tool crawls according to the crawling task D, the user can select the termination condition A as the termination condition of the current crawling, so that the data crawled by the crawler tool at this time is the data required by the crawling task D.
S130: and if the crawling result meets the termination condition, controlling the crawler tool to finish crawling.
And if the data in the crawling result completely exist in the termination condition, determining that the crawling result is successfully matched with the termination condition, and determining that the crawling result meets the termination condition.
In addition, if the crawling result does not meet the termination condition, whether the crawling result meets the termination condition can be judged again, so that the condition that whether the crawling result meets the termination condition or not is judged for the first time, and the condition that the crawler tool cannot be controlled to finish crawling is avoided; also can control the crawler tool to continue to climb, after an interval of time, reacquire the result of crawling of the crawler tool, judge whether reacquired the result of crawling satisfies the termination condition to avoid the crawler tool to climb indefinitely.
It should be noted here that the crawling tool continuously crawls according to the crawling task before crawling is not finished, and controlling the crawling tool to continue crawling here means not controlling the crawling tool to finish crawling, so that the crawling tool can continue crawling according to the crawling task.
In practical application, firstly, a user configures termination conditions according to crawling requirements; then, the crawler tool crawls according to the crawling task; then, the crawler tool obtains a crawling result and judges whether the crawling result meets a termination condition; and finally, if the crawling result meets the termination condition, the crawler tool finishes crawling.
It is thus clear that before the crawler tool crawls according to crawling the task, dispose termination condition according to crawling the demand, can carry out nimble configuration to termination condition, make the result of crawling of crawler tool finish crawling when satisfying termination condition, that is to say, can make the crawler tool finish crawling after satisfying the crawling demand, avoid the crawler tool to crawl more the above-mentioned data that do not need in crawling the demand, or crawl the above-mentioned partial data that need in crawling the demand for a short time, can improve the accuracy of the data that the crawler tool crawled.
Based on the foregoing embodiment, whether the crawling result satisfies the termination condition is determined more accurately and conveniently. Further, S120 includes:
s121: judging whether the crawled data comprises target data or not;
s122: and judging whether the value of the crawling parameter reaches a preset value or not.
S121 and S122 are not in sequence, and S121 and S122 may be selectively executed, or one of S121 and S122 may be selectively executed.
Specifically, when the termination condition configured for the crawler tool is a data termination condition, the crawling result may be data that has been crawled by the crawler tool, may also be page request data, and may also be some element in the page. And if the crawled data comprises target data, or request data appears in the page, or a certain element disappears in the page, judging that the crawled result meets the termination condition so as to accurately and conveniently judge whether the crawled result meets the termination condition.
Generally speaking, data required by the crawling task form target data, whether the data which is crawled by the crawler tool at present is the data required by the crawling task can be judged by judging whether the crawled data comprises the target data, and then whether the crawler tool is controlled to finish crawling can be directly determined.
For example: the data needed by the crawling task are A and B, if the data crawled by the crawler tool at present is A, the data crawled by the crawler tool at present is determined not to be all the data needed by the crawling task, and the crawler tool is not controlled to finish crawling, so that the crawler tool continues to crawl; if the data crawled by the crawler tool at present are A and B, determining that the data crawled by the crawler tool at present are all data required by a crawling task, and controlling the crawler tool to finish crawling; if the data that the crawler tool crawls at present is C, the data that the crawler tool crawls at present is determined not to be the data needed by the crawling task, and then prompt information can be generated to remind a user that the data that the crawler tool crawls are not the data needed by the crawling task.
Further, when the termination condition configured for the crawler tool is a behavior termination condition, the crawling result may be a crawling parameter, where the crawling parameter is data related to an operation performed by the crawler tool in the crawling process, such as: the number of single page operations, the number of page turning times, the crawling path depth, the number of pages to be crawled by the crawler tool at the next node, and at least one of the ratio of the number of nodes behind the node where the crawler tool is currently located to the total number of nodes. And if the crawling parameter reaches a preset value, wherein the preset value is a preset value in the termination condition, judging that the crawling result meets the termination condition so as to accurately and conveniently judge whether the crawling result meets the termination condition.
Based on the foregoing embodiments, data crawled for a crawler tool is more accurate. Further, S121 includes:
s1211 a: and respectively obtaining the characteristics of the crawled data and the characteristics of the target data.
The characteristic of the crawled data is obtained based on the crawled data, the crawled data is data crawled by a crawler tool according to a crawling task, and the crawler tool can obtain the characteristic of the data when crawling one data, namely the characteristic of the crawled data. The characteristics of the target data are obtained based on the target data, the target data are data required by the crawling task, and the common characteristics of the data required by the crawling task, namely the characteristics of the target data, can be obtained according to the data required by the crawling task.
Here, the characteristic means a physical quantity capable of indicating an attribute of data, for example: time of generation of data, type of data, etc.
S1211 b: and determining whether the crawled data comprises the target data or not according to the comparison result of the characteristics of the crawled data and the characteristics of the target data.
The characteristics of the crawled data are compared with the characteristics of the target data, and the characteristics can be compared by a crawler tool, so that the real-time property of the crawler tool for processing the data can be improved; or the comparison can be carried out by computer programs except the crawler tool, so that the crawler tool can be prevented from occupying excessive space resources. The execution subject that compares the features of the crawled data with the features of the target data is not specifically limited herein.
Specifically, if the features of the crawled data are matched with the features of the target data, determining that the current data crawled by the crawler tool is the target data, and storing the data; and if the characteristics of the crawled data are not matched with the characteristics of the target data, determining that the current data crawled by the crawler tool is not the target data, and deleting the data.
Therefore, after the crawler tool crawls for a period of time according to the crawling task, when the number of posts stored by the crawler tool is the same as the number of posts required by the crawling task, the fact that the crawled data comprises target data can be determined; however, after the crawler tool crawls for a period of time according to the crawling task, the crawler tool does not store any posts, or the number of posts stored by the crawler tool is obviously smaller than the number of posts required by the crawling task, it is determined that the crawled data does not include the target data.
Therefore, according to the comparison result of the characteristics of the crawled data and the characteristics of the target data, whether the crawled data comprises the target data or not can be accurately determined, the phenomenon that the crawler tool crawls more data irrelevant to the target data or crawls less target data can be more accurately avoided, and the data crawled by the crawler tool are more accurate.
S121 is explained below by way of specific examples.
The crawling task needs all posts published in 8 months in 2018 in a certain automobile forum, the crawler tool crawls the posts in the automobile forum according to the crawling task, the posting time of the posts can be obtained every time the crawler tool crawls one post, and the posting time of the posts can be in the following two situations:
the first condition is that the posting time of the post is 8 months in 2018, the posting time of the post is the same as that of the post required by the crawling task, and the posting time of the post is 8 months in 2018, so that the post can be determined to be the post required by the crawling task, and then the post is stored;
in the second case, if the posting time of the post is not 2018, for example 2018, 7, and the posting time of the post is different from the posting time of the post required by the crawling task, it may be determined that the post is not required by the crawling task, and then the post is deleted.
Then, after the crawler tool crawls for a period of time according to the crawling task, when the number of posts stored by the crawler tool is the same as the number of posts required by the crawling task, the fact that the crawled data comprises target data can be determined, wherein the number of posts required by the crawling task can be preset or can exist in a termination condition; however, after the crawler tool crawls for a period of time according to the crawling task, the crawler tool does not store any posts, or the number of posts stored by the crawler tool is obviously smaller than the number of posts required by the crawling task, it is determined that the crawled data does not include the target data.
Based on the foregoing embodiments, it is convenient to determine whether or not crawled data includes target data. Further, S121 includes:
s1212 a: and determining whether the crawled data comprises target data according to whether preset content is obtained.
The preset content is generated on the current page after the crawler tool crawls the target data. That is, only after the crawler tool crawls the target data, the page generates the preset content, and the crawler tool can be known to crawl the target data through the preset content.
Here, the preset content may be page Request data, i.e., a Request; the preset content can also be page Response data, namely Response; the preset content may also be a particular element on the page. The implementation form of the preset content is not particularly limited herein.
Specifically, when the crawler tool crawls a page according to a crawling task, the crawler tool or a computer program except the crawler tool obtains preset content, and it can be determined that the crawler tool has crawled target data, that is, it is determined that the crawled data includes the target data; the crawler tool or a computer program other than the crawler tool still does not obtain the preset content after crawling for a period of time, that is, it is determined that the crawler tool does not crawl the target data, that is, it is determined that the crawled data does not include the target data.
Therefore, whether the crawled data comprise the target data or not can be conveniently determined by confirming whether the page generates the preset content or not, and the condition that a crawler tool crawls more data irrelevant to the target data or crawls less part of data in the target data can be avoided.
Based on the foregoing embodiment, whether the page generates the preset content is confirmed more conveniently. Further, the preset content includes: at least one of a page element and page request data.
The page elements can be elements in a page which are associated with the target data, and when the target data is crawled, the page elements can be from none to any, or from any to any; likewise, the page request data may be request data in a page that is associated with the target data, which may be generated in the current page when crawling to the target data.
Specifically, the crawling of the crawler tool to the target data can be determined according to the page elements generated on the page, and the crawling of the crawler tool to the target data can also be determined according to the disappearance of the page elements on the page. Here, the page element may be a symbol or a piece of information, and is not limited in detail here. Whether the page generates the preset content or not can be conveniently confirmed by judging whether the page elements are generated on the page or not or whether the page elements on the page disappear or not.
For example: the crawling task needs to crawl all posts published in 2018 and 8 months in a certain automotive forum, when the crawler tool crawls all the posts published in 2018 and 8 months in the automotive forum, a page generates a prompt message for prompting that all the posts published in 2018 and 8 months are crawled out, or elements with 2018 and 8-month characters on the page disappear, and therefore the page can be clearly and conveniently confirmed to generate preset content.
Furthermore, whether the page generates the preset content or not can be determined according to whether the page request data is generated on the page or not. Here, the page request data may also refer to page response data.
Specifically, the page request data is related to the crawling task, when the target data is crawled, the page can generate the page request data, and the crawler tool can intercept the page request data to determine that the page generates the preset content; when the crawler tool does not crawl the target data, the page does not generate page request data, the crawler tool does not have the page request data to intercept, and the crawler tool does not intercept the page request data, so that the page is determined not to generate the preset content. Therefore, whether the page generates the preset content or not can be clearly and conveniently confirmed by whether the page request data is generated on the page or not.
For example: the crawling task needs to crawl all posts published in 2018 and 8 months in a certain automobile forum, when the crawler tool crawls all the posts published in 2018 and 8 months in the automobile forum, a page can generate page request data to prompt a user whether to continuously crawl the posts published in a time except 2018 and 8 months, and when the page request data is obtained, the user can clearly and conveniently confirm that preset content is generated on the page.
Based on the foregoing embodiments, data crawled for a crawler tool is more accurate. Further, the crawling parameters include: the number of single page operations, the number of page turning times, the crawling path depth, the number of pages to be crawled by the crawler tool at the next node and at least one of the ratio of the number of nodes behind the node where the crawler tool is located to the total number of nodes.
Specifically, when the crawling parameter is the number of single page operations, the number of operations may be the number of clicks of a page element by a crawler tool. When the crawler tool crawls according to the crawling task, if the number of clicks of the crawler tool on the page is greater than 50, the crawler tool is indicated to have crawled the page sufficiently, and the value of a crawling parameter is determined to be greater than a preset value; and if the clicking number of the crawler tool on the page is less than or equal to 50, which indicates that the crawler tool does not crawl the page sufficiently, determining that the value of the crawling parameter is less than or equal to a preset value.
Here, the number of clicks of the page element of the current page by the crawler tool can be easily obtained through the operation behavior of the crawler tool, and therefore, whether the value of the crawling parameter is larger than the preset value or not can be conveniently judged according to the number of clicks of the page element of the current page by the crawler tool. The number of hits 50 is merely an example, and the number of hits may be 40, 60, or the like, and is not particularly limited.
Or, when the crawling parameter is the page turning number, the page turning number refers to the number of pages that have been turned over in the process of crawling the multiple pages by the crawler tool. When the crawler tool crawls according to the crawling task, if the page turning times of the crawler tool are more than 20, the crawler tool is indicated to be fully crawled, and the value of a crawling parameter is determined to be more than a preset value; and if the page turning times of the crawler tool are less than or equal to 20, the crawler tool does not crawl the page sufficiently, and the value of the crawling parameter is determined to be less than or equal to the preset value.
Here, the page turning times of the crawler tool are also easily obtained through the operation behavior of the crawler tool, and therefore, whether the value of the crawling parameter is larger than the preset value or not can be conveniently judged according to the page turning times of the crawler tool. The number of page turning times 20 is only an example, and the number of clicks may be 10, 30, or the like, and is not particularly limited.
Moreover, when the crawling parameter is a crawling path depth, the crawling path depth refers to a depth that the crawler tool has crawled in the crawling path. When the crawler tool crawls according to the crawling task, if the crawling path depth of the crawler tool is larger than 3, the crawler tool is indicated to be fully crawled, and the value of a crawling parameter is determined to be larger than a preset value; and if the crawling path depth of the crawler tool is less than or equal to 3, which indicates that the crawler tool does not crawl sufficiently, determining that the value of the crawling parameter is less than or equal to a preset value.
Here, the crawling path depth of the crawler tool is also easily obtained through the operation behavior of the crawler tool, and therefore, whether the value of the crawling parameter is larger than the preset value or not can be conveniently judged according to the crawling path depth of the crawler tool. The page turn count 3 is only an example, and the number of clicks may be 2, 4, or the like, and is not particularly limited.
In addition, when crawling the parameter for the crawler tool in the next node need crawl the page quantity, if the task of crawling need crawl all users' information in a certain forum, with crawl the corresponding chain of crawling of task and be: forum list page- > content details page- > user information page. The forum list page is the home page or homepage of the forum, and a list composed of names of various information exists in the forum list page; the content detail page is a page after entering from the name of certain information in the list; the user information page is a page after the entry from the content detail page, on which user information is displayed.
Specifically, after the crawler tool enters the content detail page of the information a, when the number of pages of the user information to be crawled next step is greater than or equal to 50, it is determined that the value of the crawling parameter is greater than the preset value, and although target data may still exist in the next page of the content detail page of the information a, the importance of the target data obtained at the position is lower than that of all data to be crawled by the crawling task, and therefore, it is determined that the value of the crawling parameter is greater than the preset value at the position, and the crawling efficiency can be improved. If the number of the pages of the user information to be crawled next step is less than 50 after the crawler tool enters the content detail page of the information A, the crawled data does not include the target data, so that excessive crawling time is not wasted, crawling missing of the target data can be avoided, and integrity of the crawled data can be improved.
It should be noted that the number of pages 50 is only an example, and the number of pages may also be 30, 40, 60, 70, and the like, and is not limited herein.
Furthermore, when the crawling parameter is the ratio of the number of nodes behind the node where the crawler tool is located to the total number of nodes, certain nodes on the crawling chain can be automatically expanded in the crawling process according to the crawling task, when the crawler tool crawls according to the crawling task, if the ratio of the number of nodes behind the node where the crawler tool is located to the total number of nodes of the crawling chain is smaller than or equal to 30%, namely when the crawler tool finishes the task of crawling 70% or more than 70% of the nodes on the chain, the crawler tool is explained to have crawled most of data in all data needed by the crawling task, and then the value of the crawling parameter is determined to be larger than the preset value. If the ratio of the number of nodes behind the node where the crawler tool is located to the total number of nodes is larger than 30%, namely the number of nodes on the crawling chain completed by the crawler tool is less than 70% of the total number of nodes, it is indicated that the crawler tool has not crawled most of data in all data required by the crawling task, and the value of the crawling parameter is determined to be larger than the preset value. Like this, can also improve the efficiency of crawling of crawler tool when can improve the accuracy that crawler tool crawled data.
It should be noted that the above ratio of 30% is only an example, and the above ratio may also be 20%, 40%, etc., and is not limited herein.
In addition, before the crawler tool carries out crawling action every time, a crawling result is obtained, and whether the crawling result meets the termination condition or not is judged. Therefore, the crawler tool can be effectively prevented from crawling any data irrelevant to the target data, and the crawler tool can accurately crawl the data.
The whole working process of the method for terminating crawling by the crawler tool in the embodiment of the invention is described by specific examples.
In the process of crawling by the crawler tool according to the crawling task, firstly, the crawler tool obtains a crawling result of the crawler tool, wherein the crawling result can be crawled data of the crawler tool or crawling parameters of the crawler tool; then, judging whether the crawling result meets a termination condition or not, wherein the termination condition is configured before a user executes a crawling task by using a crawler tool, the termination condition can be a data termination condition, the data termination condition can be used for judging whether the crawled data comprises target data or not, the termination condition can also be a behavior termination condition, the behavior termination condition can be used for judging whether the value of a crawling parameter reaches a preset value or not, when the crawling result meets the termination condition, whether the crawled data comprises the target data or not can be determined by comparing the characteristics of the crawled data with the characteristics of the target data, and whether the crawled data comprises the target data or not can be determined by judging whether page elements or page request data appear on a current page or not; and finally, when the crawling result is determined to meet the termination condition, the crawler tool finishes crawling.
Like this, can make the reptile instrument finish crawling after satisfying the crawling demand, avoid the reptile instrument to climb more and get the above-mentioned data that do not need in the crawling demand, perhaps climb the above-mentioned partial data that need in the crawling demand less, can improve the accuracy of the data that the reptile instrument crawled.
Based on the same inventive concept, the embodiment of the invention also provides a device for terminating crawling of the crawler tool. Fig. 2 is a schematic structural diagram of an apparatus for terminating crawling by a crawler tool in an embodiment of the present invention, and referring to fig. 2, the apparatus 200 for terminating crawling by a crawler tool includes: an obtaining module 210 configured to obtain a crawling result of a crawler tool; a determining module 220 configured to determine whether the crawling result meets a termination condition, wherein the termination condition can be configured according to a crawling requirement; and the control module 230 is configured to control the crawler tool to finish crawling if the crawling result meets the termination condition.
Based on the above embodiment, the determining module is configured to determine whether the crawled data includes the target data when the crawling result includes the crawled data; if yes, judging that the crawling result meets a termination condition; and/or judging whether the value of the crawling parameter reaches a preset value or not when the crawling result comprises the crawling parameter; and if so, judging that the crawling result meets the termination condition.
Based on the above embodiment, the determining module is configured to obtain the features of the crawled data and the features of the target data respectively; determining whether the crawled data comprises target data or not according to a comparison result of the characteristics of the crawled data and the characteristics of the target data; if the comparison result is that the features of the crawled data are matched with the features of the target data, determining that the crawled data comprise the target data; and if the comparison result is that the features of the crawled data are not matched with the features of the target data, determining that the crawled data do not comprise the target data.
Based on the above embodiment, the determining module is configured to determine whether the crawled data includes target data according to whether preset content is obtained, where the preset content is generated in the current page after the crawler tool crawls the target data; if the preset content is obtained, determining that the crawled data comprises target data; and if the preset content is not obtained, determining that the crawled data does not comprise the target data.
Based on the above embodiment, the preset content includes: at least one of a page element and page request data.
Based on the above embodiments, the crawling parameters include: the number of single page operations, the number of page turning times, the crawling path depth, the number of pages to be crawled by the crawler tool at the next node and at least one of the ratio of the number of nodes behind the node where the crawler tool is located to the total number of nodes.
Here, it should be noted that: the above description of the apparatus embodiments, similar to the above description of the method embodiments, has similar beneficial effects as the method embodiments. For technical details not disclosed in the embodiments of the apparatus according to the invention, reference is made to the description of the embodiments of the method according to the invention for understanding.
Based on the same inventive concept, the embodiment of the present invention further provides an electronic device, which may include the apparatus for terminating crawling by a crawler tool in the foregoing embodiment, and the electronic device may be a server, a personal computer, or the like. Fig. 3 is a schematic structural diagram of an electronic device in an embodiment of the present invention, and referring to fig. 3, the electronic device 300 includes: at least one processor 301; and at least one memory 302, bus 303 connected to processor 301; wherein, the processor 301 and the memory 302 complete the communication with each other through the bus 303; the processor 301 is used for calling program instructions in the memory 302 to execute the method for terminating crawling by the crawler tool in one or more embodiments as described above, and the processor 301 is configured to obtain crawling results of the crawler tool; judging whether the crawling result meets a termination condition, wherein the termination condition can be configured according to the crawling requirement; and if the crawling result meets the termination condition, controlling the crawler tool to finish crawling.
Based on the above embodiment, the processor is configured to determine whether the crawled data includes target data when the crawling result includes the crawled data; if yes, judging that the crawling result meets a termination condition; and/or judging whether the value of the crawling parameter reaches a preset value or not when the crawling result comprises the crawling parameter; and if so, judging that the crawling result meets the termination condition.
Based on the above embodiment, the processor is configured to obtain the features of the crawled data and the features of the target data respectively; determining whether the crawled data comprises target data or not according to a comparison result of the characteristics of the crawled data and the characteristics of the target data; if the comparison result is that the features of the crawled data are matched with the features of the target data, determining that the crawled data comprise the target data; and if the comparison result is that the features of the crawled data are not matched with the features of the target data, determining that the crawled data do not comprise the target data.
Based on the above embodiment, the processor is configured to determine whether the crawled data includes target data according to whether preset content is obtained, wherein the preset content is generated in the current page after the crawler tool crawls the target data; if the preset content is obtained, determining that the crawled data comprises target data; and if the preset content is not obtained, determining that the crawled data does not comprise the target data.
Based on the above embodiment, the preset content includes: at least one of a page element and page request data.
Based on the above embodiments, the crawling parameters include: the number of single page operations, the number of page turning times, the crawling path depth, the number of pages to be crawled by the crawler tool at the next node and at least one of the ratio of the number of nodes behind the node where the crawler tool is located to the total number of nodes.
Here, it should be noted that: the above description of the embodiments of the electronic device is similar to the description of the embodiments of the method described above, and has similar advantageous effects to the embodiments of the method. For technical details not disclosed in the embodiments of the electronic device according to the embodiments of the present invention, please refer to the description of the method embodiments of the present invention.
Based on the same inventive concept, the present invention also provides a computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements the method for terminating crawling by crawler tool in one or more embodiments as described above.
Here, it should be noted that: the above description of the computer-readable storage medium embodiments is similar to the description of the method embodiments described above, with similar beneficial effects as the method embodiments. For technical details not disclosed in the embodiments of the computer-readable storage medium of the embodiments of the present invention, reference is made to the description of the method embodiments of the present invention for understanding.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims (10)

1. A method of terminating crawling of a crawler, the method comprising:
obtaining a crawling result of a crawler tool;
judging whether the crawling result meets a termination condition or not, wherein the termination condition can be configured according to the crawling requirement;
and if the crawling result meets the termination condition, controlling the crawler tool to finish crawling.
2. The method of claim 1, wherein the determining whether the crawl results satisfy termination conditions comprises:
when the crawling result comprises crawled data, judging whether the crawled data comprises target data; if yes, judging that the crawling result meets a termination condition; and/or the presence of a gas in the gas,
when the crawling result comprises a crawling parameter, judging whether the value of the crawling parameter reaches a preset value; and if so, judging that the crawling result meets a termination condition.
3. The method of claim 2, wherein the determining whether the crawled data includes target data comprises:
respectively obtaining the characteristics of the crawled data and the characteristics of the target data;
determining whether the crawled data comprises the target data or not according to a comparison result of the characteristics of the crawled data and the characteristics of the target data;
if the comparison result is that the features of the crawled data are matched with the features of the target data, determining that the crawled data comprise the target data; and if the comparison result is that the features of the crawled data are not matched with the features of the target data, determining that the crawled data do not comprise the target data.
4. The method of claim 2, wherein the determining whether the crawled data includes target data comprises:
determining whether the crawled data comprises the target data according to whether preset content is obtained or not, wherein the preset content is generated in a current page after the crawler tool crawls the target data;
if the preset content is obtained, determining that the crawled data comprises the target data; and if the preset content is not obtained, determining that the crawled data does not comprise the target data.
5. The method of claim 4, wherein the preset content comprises: at least one of a page element and page request data.
6. The method of claim 2, wherein the crawling parameters comprise: the number of single page operations, the number of page turning times, the crawling path depth, the number of pages to be crawled by the crawler tool at the next node, and at least one of the ratio of the number of nodes behind the node where the crawler tool is located to the total number of nodes.
7. An apparatus for terminating crawling of a crawler, the apparatus comprising:
an acquisition module configured to obtain a crawling result of a crawler tool;
the judging module is configured to judge whether the crawling result meets a termination condition, and the termination condition can be configured according to crawling requirements;
a control module configured to control the crawler tool to end crawling if the crawling result satisfies the termination condition.
8. The apparatus of claim 7, wherein:
the judging module is configured to judge whether the crawled data comprises target data or not when the crawling result comprises the crawled data; if yes, judging that the crawling result meets a termination condition; and/or judging whether the value of the crawling parameter reaches a preset value or not when the crawling result comprises the crawling parameter; if so, judging that the crawling result meets a termination condition; and/or the presence of a gas in the gas,
the judging module is configured to obtain the features of the crawled data and the features of the target data respectively; determining whether the crawled data comprises target data or not according to a comparison result of the characteristics of the crawled data and the characteristics of the target data; if the comparison result is that the features of the crawled data are matched with the features of the target data, determining that the crawled data comprise the target data; if the comparison result is that the features of the crawled data are not matched with the features of the target data, determining that the crawled data do not comprise the target data; and/or the presence of a gas in the gas,
the judging module is configured to determine whether the crawled data comprises target data according to whether preset content is obtained, wherein the preset content is generated in a current page after the crawler tool crawls the target data; if the preset content is obtained, determining that the crawled data comprises target data; and if the preset content is not obtained, determining that the crawled data does not comprise the target data.
9. An electronic device, comprising: memory, processor and computer program stored on the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1 to 6 when executing the program.
10. A computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, carries out the method according to any one of claims 1 to 6.
CN201811145418.6A 2018-09-29 2018-09-29 Method and device for stopping crawling of crawler tool Active CN110968770B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811145418.6A CN110968770B (en) 2018-09-29 2018-09-29 Method and device for stopping crawling of crawler tool

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811145418.6A CN110968770B (en) 2018-09-29 2018-09-29 Method and device for stopping crawling of crawler tool

Publications (2)

Publication Number Publication Date
CN110968770A true CN110968770A (en) 2020-04-07
CN110968770B CN110968770B (en) 2023-09-05

Family

ID=70027161

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811145418.6A Active CN110968770B (en) 2018-09-29 2018-09-29 Method and device for stopping crawling of crawler tool

Country Status (1)

Country Link
CN (1) CN110968770B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112650570A (en) * 2020-12-29 2021-04-13 百果园技术(新加坡)有限公司 Dynamically expandable distributed crawler system, data processing method and device
CN113419781A (en) * 2021-07-19 2021-09-21 湖南四方天箭信息科技有限公司 Crawler method and device based on Chrome plug-in, computer equipment and storage medium

Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080284244A1 (en) * 2004-12-20 2008-11-20 Tokyo Institute Of Technology Endless Elongated Member for Crawler and Crawler Unit
JP2009179286A (en) * 2008-02-01 2009-08-13 Mitsubishi Agricult Mach Co Ltd Working vehicle
CN102646129A (en) * 2012-03-09 2012-08-22 武汉大学 Topic-relative distributed web crawler system
CN104281608A (en) * 2013-07-08 2015-01-14 上海锐英软件技术有限公司 Emergency analyzing method based on microblogs
CN104408195A (en) * 2014-12-15 2015-03-11 北京国双科技有限公司 Crawler working state judging method and device
CN105279272A (en) * 2015-10-30 2016-01-27 南京未来网络产业创新有限公司 Content aggregation method based on distributed web crawlers
CN105630987A (en) * 2015-12-25 2016-06-01 北京搜狗科技发展有限公司 User agent self-adaption uniform resource locator prefix mining method and device
CN105740460A (en) * 2016-02-24 2016-07-06 中国科学技术信息研究所 Webpage collection recommendation method and device
CN105760508A (en) * 2016-02-23 2016-07-13 北京搜狗科技发展有限公司 Information push method and device and electronic equipment
CN106021257A (en) * 2015-12-31 2016-10-12 广州华多网络科技有限公司 Method, device, and system for crawler to capture data supporting online programming
CN106407218A (en) * 2015-07-31 2017-02-15 北京国双科技有限公司 Navigation webpage detection method and device
CN107025230A (en) * 2016-01-29 2017-08-08 北京国双科技有限公司 The processing method and processing device of web crawlers
CN107885820A (en) * 2017-11-07 2018-04-06 北京小度互娱科技有限公司 Breadth traversal orientation grasping means based on crawler system
CN108415941A (en) * 2018-01-29 2018-08-17 湖北省楚天云有限公司 A kind of spiders method, apparatus and electronic equipment

Patent Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8162410B2 (en) * 2004-12-20 2012-04-24 Tokyo Institute Of Technology Endless elongated member for crawler and crawler unit
US20080284244A1 (en) * 2004-12-20 2008-11-20 Tokyo Institute Of Technology Endless Elongated Member for Crawler and Crawler Unit
JP2009179286A (en) * 2008-02-01 2009-08-13 Mitsubishi Agricult Mach Co Ltd Working vehicle
CN102646129A (en) * 2012-03-09 2012-08-22 武汉大学 Topic-relative distributed web crawler system
CN104281608A (en) * 2013-07-08 2015-01-14 上海锐英软件技术有限公司 Emergency analyzing method based on microblogs
CN104408195A (en) * 2014-12-15 2015-03-11 北京国双科技有限公司 Crawler working state judging method and device
CN106407218A (en) * 2015-07-31 2017-02-15 北京国双科技有限公司 Navigation webpage detection method and device
CN105279272A (en) * 2015-10-30 2016-01-27 南京未来网络产业创新有限公司 Content aggregation method based on distributed web crawlers
CN105630987A (en) * 2015-12-25 2016-06-01 北京搜狗科技发展有限公司 User agent self-adaption uniform resource locator prefix mining method and device
CN106021257A (en) * 2015-12-31 2016-10-12 广州华多网络科技有限公司 Method, device, and system for crawler to capture data supporting online programming
CN107025230A (en) * 2016-01-29 2017-08-08 北京国双科技有限公司 The processing method and processing device of web crawlers
CN105760508A (en) * 2016-02-23 2016-07-13 北京搜狗科技发展有限公司 Information push method and device and electronic equipment
CN105740460A (en) * 2016-02-24 2016-07-06 中国科学技术信息研究所 Webpage collection recommendation method and device
CN107885820A (en) * 2017-11-07 2018-04-06 北京小度互娱科技有限公司 Breadth traversal orientation grasping means based on crawler system
CN108415941A (en) * 2018-01-29 2018-08-17 湖北省楚天云有限公司 A kind of spiders method, apparatus and electronic equipment

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
荆文鹏等: "自适应遗传算法在主题爬虫搜索策略中的应用研究", 《计算机科学》, pages 254 - 257 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112650570A (en) * 2020-12-29 2021-04-13 百果园技术(新加坡)有限公司 Dynamically expandable distributed crawler system, data processing method and device
CN113419781A (en) * 2021-07-19 2021-09-21 湖南四方天箭信息科技有限公司 Crawler method and device based on Chrome plug-in, computer equipment and storage medium

Also Published As

Publication number Publication date
CN110968770B (en) 2023-09-05

Similar Documents

Publication Publication Date Title
US8151250B2 (en) Program trace method using a relational database
CN109829096B (en) Data acquisition method and device, electronic equipment and storage medium
KR20180105678A (en) State control method and apparatus
CN106980687B (en) Resource downloading system, method and crawler downloading system
CN109685092B (en) Clustering method, equipment, storage medium and device based on big data
CN110968770A (en) Method and device for terminating crawling of crawler tool
US9436729B2 (en) Information retrieval system evaluation method, device and storage medium
CN109656950B (en) Recursive query method, device, server and storage medium
CN108491122A (en) A kind of click event response method, computer readable storage medium and terminal device
CN108170772A (en) A kind of data processing method and device
CN110633959A (en) Method, device, equipment and medium for creating approval task based on graph structure
CN112306471A (en) Task scheduling method and device
CN109885729B (en) Method, device and system for displaying data
CN109740041A (en) Web page crawl method, apparatus, storage medium and computer equipment
CN111090669A (en) Data query method and device based on space-time collision
CN114189559A (en) Interface repeat request processing method and system based on Axios
CN106611005B (en) Method and device for setting crawling time interval of crawler
CN107679107B (en) Graph database-based power grid equipment reachability query method and system
CN108549688B (en) Data operation optimization method, device, equipment and storage medium
CN102929877B (en) List data on webpage is generated method and the device of form document
CN106919503B (en) Application program testing method and device
CN102203730B (en) Method and device for choosing open application programming interface
CN107517273B (en) Data migration method, system, computer readable storage medium and server
CN108681455B (en) Method and device for converting graph and code
CN105898037A (en) Application pushing method and terminal

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant