CN112100083A - Crawler template change monitoring method and system, electronic equipment and storage medium - Google Patents

Crawler template change monitoring method and system, electronic equipment and storage medium Download PDF

Info

Publication number
CN112100083A
CN112100083A CN202011265722.1A CN202011265722A CN112100083A CN 112100083 A CN112100083 A CN 112100083A CN 202011265722 A CN202011265722 A CN 202011265722A CN 112100083 A CN112100083 A CN 112100083A
Authority
CN
China
Prior art keywords
crawler
url
detail
data
return value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011265722.1A
Other languages
Chinese (zh)
Other versions
CN112100083B (en
Inventor
王琛
李青龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Smart Starlight Information Technology Co ltd
Original Assignee
Beijing Smart Starlight Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Smart Starlight Information Technology Co ltd filed Critical Beijing Smart Starlight Information Technology Co ltd
Priority to CN202011265722.1A priority Critical patent/CN112100083B/en
Publication of CN112100083A publication Critical patent/CN112100083A/en
Application granted granted Critical
Publication of CN112100083B publication Critical patent/CN112100083B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Preventing errors by testing or debugging software
    • G06F11/3604Software analysis for verifying properties of programs
    • G06F11/3612Software analysis for verifying properties of programs by runtime analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Computer Hardware Design (AREA)
  • Quality & Reliability (AREA)
  • Data Mining & Analysis (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a method and a system for monitoring crawler template change, electronic equipment and a storage medium, wherein the method comprises the following steps: searching a crawler script without data in a crawler script library; storing the configuration id corresponding to all the searched crawler scripts without data into a crawler script database without data; adding the configuration id in the data-free crawler script database into a detection queue for data-free detection; obtaining code information of a crawler script corresponding to the configuration id according to the configuration id; according to the code information, obtaining a URL set in the crawler script, and traversing and downloading each URL in the URL set to obtain a downloading result value of each URL; and determining whether the crawler template is changed or not according to whether the URL downloading result value is empty or not, whether the URL request response state code is equal to a first preset state code or not, whether the URL detail link quantity is greater than zero or not and whether the callback return value of the callback function in the three-layer template has a value or not. The method realizes automatic crawler template change monitoring by monitoring a plurality of return values.

Description

Crawler template change monitoring method and system, electronic equipment and storage medium
Technical Field
The invention relates to the technical field of data monitoring, in particular to a method and a system for monitoring crawler template change, electronic equipment and a storage medium.
Background
The problem can be found in time by monitoring the quality of the crawler data, and the reliability of the data is ensured.
Typically, monitoring crawler data quality includes the steps of: firstly, analyzing whether the spider file normally runs (namely, whether data exists or not); under the condition of data, receiving a data source for detection, and comparing rules; under the condition of no data, the setting method and the rule operation framework are used for detection, and a specific monitoring flow chart is shown in fig. 1. The setting method and the rule in the steps need to be manually found according to experience, and are not intelligent enough.
Disclosure of Invention
In view of this, embodiments of the present invention provide a method and a system for monitoring a crawler template change, an electronic device, and a storage medium, so as to solve the problem in the prior art that monitoring of the crawler template change is not intelligent enough.
Therefore, the embodiment of the invention provides the following technical scheme:
according to a first aspect, an embodiment of the present invention provides a method for monitoring a crawler template change, including: searching a crawler script without data in a crawler script library, wherein the crawler script library comprises a crawler script without data and a crawler script with data, which cannot crawl the data, and each crawler script uniquely corresponds to a configuration id; storing the configuration id corresponding to all the searched crawler scripts without data into a crawler script database without data; adding the configuration id in the data-free crawler script database into a detection queue for data-free detection; obtaining code information of a crawler script corresponding to the configuration id according to the configuration id; acquiring a URL set in the crawler script according to the code information, and downloading each URL in the URL set in a traversing manner to acquire a downloading result value of each URL; judging whether the URL downloading result value is null or not; if the URL downloading result value is not null, judging whether the URL request response status code is equal to a first preset status code or not; if the URL request response status code is equal to the first preset status code, crawler analysis is carried out on the URL to obtain the number of URL detail links; judging whether the number of the URL detail links is greater than zero; if the number of the URL detail links is less than or equal to zero, indicating that the list page template is changed; if the number of the URL detail links is larger than zero, judging whether the callback return value of the callback function in the three-layer template has a value or not; if the callback return value has no value, the URL has no secondary link, and two-layer template detection is carried out; and if the callback return value has a value, the URL has a secondary link, and three-layer template detection is carried out.
Optionally, if the callback return value has no value, the URL has no secondary link, and the step of performing two-layer template detection includes: if the callback return value has no value, downloading the list link to obtain a list link downloading return value; judging whether the list link download return value is null or not; if the list link download return value is not null, judging whether the list link request status code is equal to a second preset status code; and if the list link request state code is equal to the second preset state code, entering a detail page for detection.
Optionally, if the callback return value has a value, the URL has a secondary link, and the step of performing three-tier template detection includes: if the callback return value has a value, downloading a secondary list page link to obtain a secondary list page link downloading return value; judging whether the link downloading return value of the secondary list page is empty or not; if the secondary list page link download return value is not null, judging whether the secondary list page request status code is equal to a third preset status code; if the second-level list page request state code is equal to the third preset state code, entering second-level crawler analysis to obtain the number of detail links of the second-level list page; judging whether the number of detail links of the secondary list page is greater than zero or not; if the number of the detail links of the secondary list page is less than or equal to zero, indicating that the template of the secondary list page is changed; if the number of the secondary list detail links is larger than zero, traversing and downloading the secondary list detail page links to obtain a secondary list detail page downloading return value; judging whether the download return value of the secondary list detail page is empty or not; if the download return value of the secondary list detail page is not null, judging whether the request status code of the secondary list detail page is equal to a fourth preset status code; and if the second-level list detail page request status code is equal to a fourth preset status code, entering a detail page for detection.
Optionally, the step of entering the detail page for detection includes: downloading the first detail function to obtain a first detail function downloading return value; judging whether the first detail function downloading return value has a value or not; if the first detail function download return value has no value, the detail template is changed; if the first detail function download return value has a value, detecting a title field and a content field; when the title field and the content field can be analyzed, the detail template is not changed; when the title field and the content field cannot be resolved, the detail template is changed.
Optionally, the entering the step of detecting the detail page further includes: in the downloading process of the first detail function, if the analysis is abnormal, the analysis of the crawler script fails.
Optionally, if the list link download return value is null, and/or the list link request status code is not equal to the second preset status code, and/or the second-level list detail page request status code is not equal to a fourth preset status code, the method further includes: downloading a second detail function to obtain a second detail function downloading return value; judging whether the download return value of the second detail function is null or not; if the download return value of the second detail function is null, performing link filtering; and if the download return value of the second detail function is not null, indicating that the link download of the detail page fails.
Optionally, the method further comprises: searching a crawler script with data in a crawler script library; storing the configuration id corresponding to all the searched crawler scripts with data into a crawler script database with data; obtaining a corresponding data source according to the configuration id in the database with the data crawler script; monitoring each field in the data source according to field verification information, wherein the field verification information comprises at least one of time error monitoring, field character length monitoring, title length monitoring, configuration grouping error monitoring, channel error monitoring, website name error monitoring, website domain name error monitoring, grouping error monitoring, field error monitoring and detail page URL monitoring.
According to a second aspect, an embodiment of the present invention provides a system for monitoring a crawler template change, including: the system comprises a first processing module, a second processing module and a third processing module, wherein the first processing module is used for searching a crawler script without data in a crawler script library, the crawler script library comprises a crawler script without data and a crawler script with data, the crawler script cannot crawl the data, the crawler script can crawl the data, and each crawler script uniquely corresponds to a configuration id; the second processing module is used for storing the configuration ids corresponding to all the searched data-free crawler scripts into a data-free crawler script database; the third processing module is used for adding the configuration id in the data-free crawler script database into a detection queue for data-free detection; the fourth processing module is used for obtaining code information of the crawler script corresponding to the configuration id according to the configuration id; the fifth processing module is used for obtaining a URL set in the crawler script according to the code information, and downloading each URL in the URL set in a traversing manner to obtain a downloading result value of each URL; the first judgment module is used for judging whether the URL downloading result value is null or not; the second judgment module is used for judging whether the URL request response status code is equal to the first preset status code or not if the URL downloading result value is not null; the sixth processing module is used for performing crawler analysis on the URL to obtain the URL detail link quantity if the URL request response status code is equal to the first preset status code; the third judging module is used for judging whether the number of the URL detail links is greater than zero; the seventh processing module is used for indicating the change of the list page template if the number of the URL detail links is less than or equal to zero; the fourth judging module is used for judging whether the callback return value of the callback function in the three-layer template has a value or not if the number of the URL detail links is greater than zero; the eighth processing module is used for detecting the two-layer template if the callback return value has no value, and the URL has no secondary link; and the ninth processing module is used for detecting the three-layer template if the callback return value has a value, and the URL has a secondary link.
Optionally, the eighth processing module includes: the first processing unit is used for downloading the list link if the callback return value has no value, so as to obtain a list link downloading return value; a first judging unit, configured to judge whether the list link download return value is empty; a second judging unit, configured to judge whether the list link request status code is equal to a second preset status code if the list link download return value is not null; and the second processing unit is used for entering a detail page for detection if the list link request state code is equal to the second preset state code.
Optionally, the ninth processing module includes: the third processing unit is used for downloading the secondary list page link if the callback return value has a value, so as to obtain a secondary list page link downloading return value; a third judging unit, configured to judge whether the secondary list page link download return value is empty; a fourth judging unit, configured to judge whether the secondary list page request status code is equal to a third preset status code if the secondary list page link download return value is not null; the fourth processing unit is used for entering secondary crawler analysis to obtain the number of detail links of the secondary list page if the secondary list page request status code is equal to the third preset status code; a fifth judging unit, configured to judge whether the number of detail links of the secondary list page is greater than zero; the fifth processing unit is used for indicating that the secondary list page template is changed if the number of the detail links of the secondary list page is less than or equal to zero; the sixth processing unit is used for traversing and downloading the secondary list detail page links to obtain a secondary list detail page downloading return value if the number of the secondary list page detail links is greater than zero; a sixth judging unit, configured to judge whether the download return value of the secondary list detail page is empty; a seventh processing unit, configured to determine whether the second-level list detail page request status code is equal to a fourth preset status code if the second-level list detail page download return value is not null; and the eighth processing unit is used for entering the detail page for detection if the second-level list detail page request status code is equal to a fourth preset status code.
Optionally, the second processing unit or the eighth processing unit includes: the first processing subunit is used for downloading the first detail function to obtain a first detail function downloading return value; the judging subunit is used for judging whether the first detail function downloading return value has a value or not; the second processing subunit is configured to indicate that the detail template is changed if the first detail function download return value has no value; a third processing subunit, configured to detect a title field and a content field if the first detail function download return value has a value; a fourth processing subunit, configured to indicate that the detail template is not changed when the title field and the content field can be resolved; and the fifth processing subunit is used for indicating the detail template change when the title field and the content field cannot be resolved.
Optionally, the second processing unit or the eighth processing unit further includes: and the sixth processing subunit is configured to, in the downloading process of the first detail function, if analysis is abnormal, fail to analyze the crawler script.
Optionally, the method further comprises: the tenth processing module is used for downloading the second detail function to obtain a second detail function downloading return value; a fifth judging module, configured to judge whether the second detail function download return value is empty; an eleventh processing module, configured to perform link filtering if the second detail function download return value is null; and the twelfth processing module is configured to indicate that the link downloading of the detail page fails if the download return value of the second detail function is not null.
Optionally, the method further comprises: the thirteenth processing module is used for searching the crawler script with data in the crawler script library; the fourteenth processing module is used for storing the configuration ids corresponding to all the searched crawler scripts with data into a crawler script database with data; the fifteenth processing module is used for obtaining a corresponding data source according to the configuration id in the data crawler script database; and a sixteenth processing module, configured to monitor each field in the data source according to field verification information, where the field verification information includes at least one of time error monitoring, field character length monitoring, title length monitoring, configuration grouping error monitoring, channel error monitoring, website name error monitoring, website domain name error monitoring, grouping error monitoring, field error monitoring, and detail page URL monitoring.
According to a third aspect, an embodiment of the present invention provides an electronic device, including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores a computer program executable by the at least one processor, the computer program being executable by the at least one processor to cause the at least one processor to perform the method for monitoring a change in a crawler template as described in any one of the above first aspects.
According to a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, where computer instructions are stored, and the computer instructions are configured to cause a computer to execute the monitoring method for crawler template change described in any one of the first aspect.
The technical scheme of the embodiment of the invention has the following advantages:
the embodiment of the invention provides a method and a system for monitoring crawler template change, electronic equipment and a storage medium, wherein the method comprises the following steps: searching a crawler script without data in a crawler script library, wherein the crawler script library comprises a crawler script without data and a crawler script with data, which cannot crawl the data, and each crawler script uniquely corresponds to a configuration id; storing the configuration id corresponding to all the searched crawler scripts without data into a crawler script database without data; adding the configuration id in the data-free crawler script database into a detection queue for data-free detection; obtaining code information of a crawler script corresponding to the configuration id according to the configuration id; acquiring a URL set in the crawler script according to the code information, and downloading each URL in the URL set in a traversing manner to acquire a downloading result value of each URL; judging whether the URL downloading result value is null or not; if the URL downloading result value is not null, judging whether the URL request response status code is equal to a first preset status code or not; if the URL request response status code is equal to the first preset status code, crawler analysis is carried out on the URL to obtain the number of URL detail links; judging whether the number of the URL detail links is greater than zero; if the number of the URL detail links is less than or equal to zero, indicating that the list page template is changed; if the number of the URL detail links is larger than zero, judging whether the callback return value of the callback function in the three-layer template has a value or not; if the callback return value has no value, the URL has no secondary link, and two-layer template detection is carried out; and if the callback return value has a value, the URL has a secondary link, and three-layer template detection is carried out. The method realizes automatic detection of whether the crawler template is changed or not by monitoring the plurality of return values.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.
FIG. 1 is a flow chart of a specific example of a prior art method for monitoring crawler template changes;
FIG. 2 is a flowchart of a specific example of a monitoring method for crawler template changes according to an embodiment of the present invention;
FIG. 3 is a flowchart of another specific example of a monitoring method for crawler template changes according to an embodiment of the present invention;
FIG. 4 is a flowchart of another specific example of a monitoring method for crawler template changes according to an embodiment of the present invention;
FIG. 5 is a flowchart of another specific example of a monitoring method for crawler template changes according to an embodiment of the present invention;
FIG. 6 is a flowchart of another specific example of a monitoring method for crawler template changes according to an embodiment of the present invention;
FIG. 7 is a flowchart of another specific example of a monitoring method for crawler template changes according to an embodiment of the present invention;
FIG. 8 is a flowchart of another specific example of a monitoring method for crawler template changes according to an embodiment of the present invention;
FIG. 9 is a schematic diagram illustrating a template change monitoring result of the monitoring method for crawler template change according to the embodiment of the present invention;
FIG. 10 is a diagram illustrating field verification detection results of a monitoring method for crawler template changes according to an embodiment of the present invention;
FIG. 11 is a block diagram of one particular example of a monitoring system for crawler template changes in accordance with an embodiment of the present invention;
fig. 12 is a schematic diagram of an electronic device according to an embodiment of the invention.
Detailed Description
The technical solutions of the present invention will be described clearly and completely with reference to the accompanying drawings, and it should be understood that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
An embodiment of the present invention provides a method for monitoring a change of a crawler template, as shown in fig. 2, the method may include steps S1-S13.
Step S1: the method comprises the steps that a crawler script without data is searched in a crawler script library, the crawler script library comprises a data-free crawler script which cannot crawl data and a data crawler script which can crawl data, and each crawler script only corresponds to one configuration id.
In an actual application process, a proper crawler script is selected from the crawler database according to actual requirements of data crawling to perform data crawling, and then data to be crawled is obtained. Each crawler script in the crawler database uniquely corresponds to a configuration id, the configuration id is identity information of the crawler script, and the configuration id is just like an identity card number of each person and has uniqueness, so that the corresponding crawler script can be found through the configuration id. The crawler script library comprises a plurality of crawler scripts, some of the crawler scripts are data crawler scripts, and some of the crawler scripts are data-free crawler scripts. The script with the data crawler is a script which can crawl corresponding data after executing codes in the script, and the script without the data crawler is a script which cannot crawl corresponding data after executing the codes in the script.
In this embodiment, crawler data detection of a preset period is performed on data crawled by the crawler script, and whether the crawler script is a data-free crawler script or a data crawler script is determined. Specifically, the preset period may be 3 days, 4 days, or other values, and may be set as needed.
The concrete process of above-mentioned data detection is if can detect the crawler data in the preset period, then the crawler script is for having data crawler script, if can not detect the crawler data in the preset period, then the crawler script is no data crawler script. In this embodiment, the preset period is 3 days, and if data is detected within 3 days, the crawler script can crawl the data as a data crawler script no matter which moment the data is detected; if no crawler data is detected within 3 days, the crawler script is a no-data crawler script. Specifically, a threshold value is set to three days according to a field in a database table, and judgment is carried out on a data warehousing day number field corresponding to a crawler script id in the database table.
Step S2: and storing the configuration id corresponding to all the searched crawler scripts without data into a database of the crawler scripts without data.
As an exemplary embodiment, there are many reasons that the crawler script cannot crawl the data, for example, a website change or a spider file (crawler script) problem, and the like, and further detection needs to be performed on these crawlers without data to determine which link has a problem and thus the crawler script cannot crawl the data, so as to modify the data in the following process. Therefore, the configuration id corresponding to all the searched crawler scripts without data is stored in the database of the crawler scripts without data so as to be detected subsequently.
Step S3: and adding the configuration id in the script database of the dataless crawler into a detection queue for dataless detection.
As an exemplary embodiment, none of the crawler scripts in the no-data crawler script database can crawl data, and the configuration id in the no-data crawler script database is added into a detection queue for detection.
Step S4: and obtaining code information of the crawler script corresponding to the configuration id according to the configuration id.
As an exemplary embodiment, the configuration id is unique identity information of the crawler script, and the crawler script corresponding to the configuration id can be found according to the configuration id, so as to obtain code information in the crawler script.
Step S5: and acquiring a URL set in the crawler script according to the code information, and downloading each URL in the URL set in a traversing manner to acquire a downloading result value of each URL.
In an exemplary embodiment, the urls in start _ url are retrieved according to the code information of the crawler script, and the request for downloading is made. Urls is a URL set, which may include several URLs, each of which is requested to be downloaded, and the result of each URL requested to be downloaded, i.e. the URL download result value, may be obtained through resp = self. Specifically, the URL download result value is a string value containing a URL download address, and the specific download result value is html source code. For example: < html > < head > < meta http-equiv = "Content-Type", and the like, containing URL, http: the embodiment is only illustrative, and not limited to this.
Step S6: and judging whether the URL downloading result value is empty or not. If the URL downloading result value is null, the URL downloading address is null, namely, the URL downloading address does not exist, and therefore, the list page downloading fails; if the URL download result value is not null, step S7 is executed.
Step S7: if the URL downloading result value is not null, whether the URL request response status code is equal to the first preset status code or not is judged. If the URL request response status code is equal to the first predetermined status code, go to step S8; if the URL request response status code is not equal to the first preset status code, the server can not find the requested webpage, so that the list page link download status code is displayed. And the list page link downloading state code is obtained according to the actual condition of the response. If the link download status code of the list page is 404, the server cannot find the requested web page; if the link download status code of the list page is 403, the server refuses the request; this is only illustrated schematically in the present embodiment, and is not limited thereto.
As an exemplary embodiment, if the URL download result value is not null, it indicates that there is a download address of the URL, so that a download request is sent to the server, and the server returns a URL request response status code after receiving the request. In this embodiment, the first preset status code is 200, which indicates that the server successfully processes the request.
Step S8: and if the URL request response status code is equal to the first preset status code, crawler analysis is carried out on the URL to obtain the URL detail link quantity.
As an exemplary embodiment, if the URL request response status code is equal to the first preset status code, it indicates that the server processes the request, and performs crawler parsing on the URL to obtain the URL detail link number.
Step S9: and judging whether the number of the URL detail links is larger than zero. If the number of the URL detail links is greater than zero, it indicates that there are detail page links in the URL, and step S10 is executed; if the number of URL detail links is less than or equal to zero, the change of the list page template is indicated, and the detail links in the URL cannot be detected due to unsuccessful change of the list page template.
As an exemplary embodiment, the length of the URL _ list returned can be viewed by calling the URL _ list, callback, _= pointer.
Step S10: if the number of URL detail links is less than or equal to zero, the change of the list page template is indicated. The number of URL detail links is less than or equal to zero, no detail page links exist in the URL, the URL does not conform to the two-layer frame, and the URL belongs to list template change.
Step S11: and if the number of the URL detail links is larger than zero, judging whether a callback return value (callback) of the callback function in the three-layer template has a value or not. If the callback return value has no value, go to step S12; if the callback return value has a value, step S13 is executed.
As an exemplary embodiment, the callback return value of the callback function is the url queue and the callback function name established in the parse function, if there is a value (the value refers to the callback function name), it means that the callback function is a three-layer template, and three-layer template detection is performed, and if the value is empty, it means that the callback function is a two-layer template, and two-layer template detection is performed.
Specifically, the return function name of the callback return value self, part _ next, also called a value, is obtained from return (url _ list, self.
Step S12: and if the callback return value has no value, the URL has no secondary link, and two-layer template detection is carried out.
As an exemplary embodiment, if the callback return value has no value, it indicates that the URL has no secondary link, and the URL is the detail page, which has a two-layer structure (list page and detail page), so that two-layer template detection is performed.
Step S13: if the callback return value has a value, the URL has a secondary link, and three-layer template detection is carried out.
As an exemplary embodiment, if the callback returns a value, it means that the URL has a secondary link, and the secondary link has a detail page, and the structure of the secondary link is three layers (list page, secondary list page, and detail page), so three-layer template detection is performed.
And in the step, the configuration id in the data-free crawler script database is subjected to data-free detection, whether the template is changed or not is automatically judged through the URL download result value, the URL request response state code, the URL detail link quantity and the callback return value, the monitoring of the crawler template change is automatically realized, and a detection rule does not need to be manually searched according to experience.
As an exemplary embodiment, if the callback return value has no value in step S12, the URL has no secondary link, and the steps of performing two-tier template detection include steps S1201-S1204.
Step S1201: and if the callback return value has no value, downloading the list link to obtain a list link downloading return value. And if the callback return value has no value, the URL has no secondary link, and the next layer of the list page is the detail page, downloading the list link in the list page to obtain a list link downloading return value.
Specifically, the list link download return value is used to determine whether the list link download is successful, and if the list link download return value is None, it indicates that the list link download is unsuccessful. If the list link download return value is not None, specifically html source code, it indicates that the list link download is successful.
Step S1202: and judging whether the list link downloading return value is null or not. If the list link download return value is null, go to step S1121; if the list link download return value is not null, step S1203 is executed.
Step S1203: and if the list link download return value is not null, judging whether the list link request status code is equal to a second preset status code. If the list link request status code is not equal to the second predetermined status code, go to step S1121; if the list link request status code is equal to the second predetermined status code, step S1204 is executed.
Step S1204: and if the list link request status code is equal to a second preset status code, entering a detail page for detection.
In this embodiment, the second preset status code is 200, which indicates that the server successfully processes the request. When the list link request status code is equal to the second preset status code, the server is indicated to process the list link download request, so that the detail page can be entered.
Specifically, the entering of the detail page for detection may be a detail page corresponding to a first link in the entering list links, and since the templates of the detail pages are the same, the first detail page is detected, so that the detection time is saved. Of course, in other embodiments, multiple detail pages may be detected as desired.
And the steps are judged by downloading the return value and the second preset state code through the list link, and the two-layer frame is detected.
As an exemplary embodiment, if the callback return value has a value in step S13, the URL has a secondary link, and the step of performing three-level template detection includes steps S1301-S1310 as shown in fig. 3.
Step S1301: and if the callback return value has a value, downloading the secondary list page link to obtain a secondary list page link downloading return value. If the callback return value has a value, a second-level list page is also included below the list page, the detail page is below the second-level list page and belongs to a three-layer frame, and the second-level list page is downloaded to obtain a download return value.
Specifically, the secondary list page download return value is used for judging whether the secondary list page link download is successful or not, and if the secondary list page link download return value is None, the secondary list page link download is not successful. If the link downloading return value of the secondary list page is not None, specifically html source code, the link downloading of the secondary list page is successful.
Step S1302: and judging whether the link downloading return value of the secondary list page is empty or not. If the link downloading return value of the secondary list page is null, the download address of the secondary list page is null, namely, the download address is not available, so that the link downloading of the secondary list page fails; if the secondary list page link download return value is not null, step S1303 is executed.
Step S1303: and if the secondary list page link download return value is not null, judging whether the secondary list page request status code is equal to a third preset status code. If the link download return value of the secondary list page is not null, the secondary list page download address is indicated, and whether the request status code of the secondary list page is correct or not is judged. In this embodiment, the third predetermined status code is 200. If the secondary list page request status code is equal to the third predetermined status code, go to step S1304; if the second-level list page request status code is not equal to the third preset status code, the server can not find the requested link, so that the list page link download status code is displayed. And the list page link downloading state code is obtained according to the actual condition of the response. If the link download status code of the list page is 404, the server cannot find the requested web page; if the link download status code of the list page is 403, the server refuses the request; this is only illustrated schematically in the present embodiment, and is not limited thereto.
Step S1304: and if the second-level list page request state code is equal to the third preset state code, entering second-level crawler analysis to obtain the detail link quantity of the second-level list page. And the second-level list page request state code is equal to a third preset state code, which indicates that the server successfully processes the request, and the server enters a second-level crawler analysis to obtain the detail link quantity of the second-level list page.
Step S1305: and judging whether the number of detail links of the secondary list page is greater than zero. If the detail link quantity of the secondary list page is not more than zero, changing a secondary page list template, and executing a step S1306; if the number of detail links of the secondary list page is greater than zero, step S1307 is executed.
Step 1306: and if the number of the detail links of the secondary list page is less than or equal to zero, indicating that the template of the secondary list page is changed. And the number of detail links of the second-level list page is less than or equal to zero, and the detail pages in the second-level list page are connected and are not in accordance with the third-level frame, so that the change of the list template is realized.
Step S1307: and if the number of the secondary list detail links is more than zero, traversing and downloading the secondary list detail page links to obtain a secondary list detail page downloading return value. And if the number of the secondary list page detail links is greater than zero, downloading the secondary list page detail page links to obtain a secondary list detail page download return value.
Step S1308: and judging whether the download return value of the second-level list detail page is null or not. If the download return value of the secondary list detail page is null, the secondary list page download fails; if the secondary list detail page download return value is not null, step S1309 is executed.
Step S1309: and if the download return value of the secondary list detail page is not null, judging whether the request status code of the secondary list detail page is equal to a fourth preset status code. If the second-level list detail page request status code is not equal to the fourth preset status code, executing step S1121; if the second level list detail page request status code is equal to the fourth predetermined status code, step S1310 is executed.
Step 1310: and if the second-level list detail page request status code is equal to a fourth preset status code, entering a detail page for detection.
And the steps judge through the secondary list page link download return value, the secondary list page request state code, the secondary list page detail link quantity and the secondary list detail page download return value, and detect the three-layer frame.
As an exemplary embodiment, the steps of detecting the entry details page in step S1204 and step S1310 include steps S1111 and S1116, as shown in fig. 4.
Step S1111: and downloading the first detail function to obtain a download return value of the first detail function.
In this embodiment, the first detail function is to analyze and acquire fields in the detail page, where the fields include a title, a release time, a text, an author, a source, and the like, and determine whether the title and the text are analyzed or not, and whether the description detail page is not analyzed.
Specifically, the first detail function may be a detail function def part _ detail _ page (self, response = None, url = None), which includes a download detail link, and the xpath parses fields, such as a title, a release time, a body, a source field, and the like.
The first detail function download return value is used for judging whether the specific content in the detail page is analyzed, and if the first detail function download return value has no value (None), the specific content in the detail page is not analyzed, and the detail page template is changed. If the download return value of the first detail function has a value (specifically, html source code), it indicates that the specific content in the detail page is resolved by xpath, and it needs to further determine whether the resolved specific content is correct.
Step S1112: and judging whether the first detail function downloading return value has a value or not. The first detail function download return value has no value, and step S1113 is executed; the first detail function download return value has a value, and step S1114 is performed.
Step S1113: if the first detail function download return value has no value, the detail template is changed. Specifically, the first detail function downloads a return value without a value, which indicates that there is no field in the detail page, so the detail template is changed.
Step S1114: if the first detail function download return value has a value, the title field and the content field are detected. Specifically, the first detail function download return value has a value, and further detection is required to determine whether the corresponding field can be resolved.
Step S1115: when the title field and the content field can be resolved, this indicates that the detail template has not been changed.
Step S1116: when the title field and the content field cannot be resolved, a change in the detail template is indicated.
And the steps judge through the analysis result of the first detail function download return value, the title field and the content field, and determine whether the detail template changes.
As an exemplary embodiment, the step of entering the step of detecting the detail page further includes step S1117.
Step S1117: in the downloading process of the first detail function, if the analysis is abnormal, the analysis of the crawler script fails.
Specifically, in the downloading process of the first detail function, if analysis is abnormal, that is, analysis cannot be performed, it is indicated that a problem occurs in the crawler code, and the crawler script analysis fails. The specific reasons may be that the code writing is not normal or wrong, and the like, and this embodiment only illustrates this example, but not limited to this.
As an exemplary embodiment, if the list link download return value is null, the list link request status code is not equal to the second preset status code, and the secondary list detail page request status code is not equal to the fourth preset status code, as shown in fig. 5, the method further includes steps S1121-S1124.
Step S1121: and downloading the second detail function to obtain a download return value of the second detail function.
In this embodiment, the second detail function is the same as the first detail function. Specifically, the second detail function may be a detail function def part _ detail _ page (self, response = None, url = None), which includes a download detail link, and the xpath parses fields, such as a title, a release time, a body, a source field, and the like.
And the second detail function download return value is used for judging whether the specific content in the detail page is analyzed, and if the second detail function download return value has no value (NONE), the specific content in the detail page is not analyzed, the crawler code has no problem, and the system possibly has false alarm. If the download return value of the second detail function has a value (the specific value may be an Html source code), it indicates that the specific content in the detail page is resolved, and it needs to further determine whether the specific content resolved by xpath is correct.
Step S1122: and judging whether the download return value of the second detail function is null or not. If the second detail function download return value is null, then step S1123 is performed; if the second detail function download return value is not null, step S1124 is executed.
Step S1123: and if the download return value of the second detail function is null, performing link filtering.
Specifically, the download return value of the second detail function is null, the crawler code has no problem, and may be a system false report, and link filtering is performed, so that the second detail function can be detected again later. The filtered links are detected again after a period of time, which may be a false alarm caused by website reasons.
Step S1124: and if the download return value of the second detail function is not null, indicating that the link download of the detail page fails.
And the step judges through the download return value of the second detail function to determine whether the connection download of the detail page fails.
The following detailed description of the no-data detection is made with a specific example, as shown in fig. 6.
It can be seen in fig. 6 that the template change system is developed based on a framework, which is divided into two-layer templates and three-layer templates.
Firstly, acquiring configuration id needing to be detected, adding the configuration id into a detection queue, acquiring configured code information, acquiring urls in start _ url, requesting to download, and going to a step of resp = self.download (url) in fig. 6, if resp is None, indicating entry failure 'list download failure', and if not, detecting next step, judging whether resp.status _ code state code is equal to 200, if not, indicating 'list page download state code (acquired according to actual conditions)' if not, calling next url _ list, callback, _ pointer, _ pointer.part (resp) method, checking that the returned url _ list has length < =0 belongs to 'list page change', if the length is greater than 0, judging whether callback has a value, if so, checking that the returned url _ list has three-layer length, and if so, directly entering function page analysis.
If the answer is no, the analysis of the detail page is started, a detail page url is requested, resp = self.download (url _ list [0]) is requested to check whether data and a state code of resp are 200, if the conditions are met, the returned resp object is transmitted into a res = client _ part _ detail _ page (resp, url) method for testing, if no res is returned, the 'detail page change' is explained, if True, the lower title field and the content field are detected, if both the fields can be analyzed, the 'template change does not occur' in the normal configuration, and if not, the 'detail page template change' is realized.
Returning, when the callback value is True, entering resp = self.download [0]) to check whether the resp has returned data, and if not, determining that the list download fails; if the condition is satisfied, the returned resp object is transmitted into a res = client _ detail _ page (resp, url) method to test, if no res is returned, the 'detail page change analysis' is explained, if the condition is satisfied, the lower title field and the content field can be detected, if the two fields are not normally changed to the 'configuration', if the two fields are not normally changed, if not, it is 'detail page template change'.
The method can detect and judge whether the website is changed or the spider file problem or not according to the frame, can reduce manual examination and check, does not need to manually test and detect one by one, reduces labor cost, can find the problem in time and is convenient for solving in time.
As an exemplary embodiment, as shown in FIG. 7, the method further includes steps S14-S17.
Step S14: and searching the crawler script with data in a crawler script library.
As an exemplary embodiment, the data-based crawler script is a script that can crawl data, but cannot guarantee the correctness of the acquired data, so the acquired data needs to be detected to guarantee the accuracy of the crawled data.
Step S15: and storing the configuration id corresponding to all the searched crawler scripts with data into a crawler script database with data.
As an exemplary embodiment, the configuration ids corresponding to all the searched crawler scripts with data are stored in the data-free crawler script database, so as to perform detailed data detection on the crawler scripts subsequently.
Step S16: and obtaining a corresponding data source according to the configuration id in the database with the data crawler script.
As an exemplary embodiment, the data source is the actual data collected by the crawler. And finding the collected data corresponding to the configuration id according to the configuration id in the data crawler script database, wherein the collected data is a data source. Specifically, after the data collection is completed, the collected data and the configuration id form a one-to-one mapping relationship, so that the data source corresponding to the configuration id can be found through the configuration id.
Step S17: monitoring each field in the data source according to field verification information, wherein the field verification information comprises at least one of time error monitoring, field character length monitoring, title length monitoring, configuration grouping error monitoring, channel error monitoring, website name error monitoring, website domain name error monitoring, grouping error monitoring, field error monitoring and detail page URL monitoring.
As an exemplary embodiment, there are a data detection process and a specific monitoring field diagram as shown in fig. 8, first, a data source is obtained, a required website information is screened out, and then, each field is monitored according to a rule of field error.
1) Monitoring time errors, and judging that the time is wrong when the news release time is more than or equal to the current time;
2) the field character length monitoring and title length monitoring are carried out according to length rules of monitoring channels, titles, sitenames and the like to judge whether errors occur
3) The problem of code disorder is that the code disorder is checked according to the keyword for judging code disorder
4) The configuration groups are determined based on whether the website name is Chinese and domain name
5) The correctness of the website name is the comparison between the basic website information uploaded outside the platform and the website information in the internal code of the platform, wherein the mode is that some labels including the website domain name, INFO _ Flag, website name and the like are also used
6) Making judgment errors with author and source empty
7) Lack of content _ xml field, list _ page _ url, etc. newly added some field errors
8) The judgment that the field without time filtering and the time filtering exceed the requirement of the acquisition days is that the lack of the time filtering and the overtime of the time filtering influence the acquisition of the historical data
9) The monitoring of the detail page url is that the problem of super-long splicing possibly occurs in the code according to the detail page url (Chinese characters, special characters and the like are excluded)
10) The number of views and the number of comments in the forum website are relatively error-prone, and the default of the frame is-1, so that whether the sum of the two fields is not equal to-1 is detected.
11) Grouping error monitoring, according to industry regulations, dividing web pages into a plurality of groups, such as 01 for news, 02 for posts, 07 for videos and the like, and determining whether the grouping is correct according to the main content of the website.
The method realizes automatic detection of the crawler data quality by monitoring a plurality of return values and a plurality of field detections.
As an exemplary embodiment, the method further comprises: acquiring a template change result, and displaying the template change result; the system is convenient to view and accurately position, and can correct problems in time. A detailed display is shown in figure 9.
As an exemplary embodiment, the method further comprises: acquiring a field detection result, and displaying the field detection result; the system is convenient to view and accurately position, and can correct problems in time. A detailed display is shown in figure 10.
In this embodiment, a system for monitoring a change of a crawler template is further provided, and the system is used to implement the foregoing embodiments and preferred embodiments, which have already been described and are not described again. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. While the system described in the embodiments below is preferably implemented in software, implementations in hardware, or a combination of software and hardware are also possible and contemplated.
This embodiment also provides a monitoring system for crawler template change, as shown in fig. 11, including: the device comprises a first processing module 1, a second processing module 2, a third processing module 3, a fourth processing module 4, a fifth processing module 5, a first judging module 6, a second judging module 7, a sixth processing module 8, a third judging module 9, a seventh processing module 10, a fourth judging module 11, an eighth processing module 12 and a ninth processing module 13.
The system comprises a first processing module 1, a second processing module and a third processing module, wherein the first processing module is used for searching a crawler script without data in a crawler script library, the crawler script library comprises a crawler script without data and a crawler script with data, the crawler script cannot crawl the data, and the crawler script can crawl the data, and each crawler script uniquely corresponds to a configuration id; the details are described with reference to step S1.
The second processing module 2 is used for storing the configuration ids corresponding to all the searched data-free crawler scripts into a data-free crawler script database; the details are described with reference to step S2.
The third processing module 3 is used for adding the configuration id in the data-free crawler script database into a detection queue for data-free detection; the details are described with reference to step S3.
The fourth processing module 4 is configured to obtain code information of a crawler script corresponding to the configuration id according to the configuration id; the details are described with reference to step S4.
A fifth processing module 5, configured to obtain a URL set in the crawler script according to the code information, and download each URL in the URL set in a traversal manner to obtain a result value of each URL download; the details are described with reference to step S5.
A first judging module 6, configured to judge whether the URL download result value is null; the details are described with reference to step S6.
A second judging module 7, configured to, if the URL download result value is not null, judge whether the URL request response status code is equal to a first preset status code; the details are described with reference to step S7.
A sixth processing module 8, configured to perform crawler analysis on the URL to obtain a URL detail link number if the URL request response status code is equal to the first preset status code; the details are described with reference to step S8.
A third judging module 9, configured to judge whether the number of URL detail links is greater than zero; the details are described with reference to step S9.
A seventh processing module 10, configured to indicate that the list page template is changed if the number of URL detail links is less than or equal to zero; the details are described with reference to step S10.
A fourth judging module 11, configured to judge whether a callback return value of a callback function in the three-layer template has a value if the number of URL detail links is greater than zero; the details are described with reference to step S11.
An eighth processing module 12, configured to perform two-layer template detection if the callback return value has no value, and the URL has no secondary link; the details are described with reference to step S12.
A ninth processing module 13, configured to, if the callback return value has a value, perform three-layer template detection, where the URL has a second-level link; the details are described with reference to step S13.
As an exemplary embodiment, the eighth processing module includes: a first processing unit, configured to download the list link if the callback return value has no value, to obtain a list link download return value, and the detailed content refers to step S1201; a first judgment unit, configured to judge whether the list link download return value is empty, where the detailed content refers to step S1202; a second determining unit, configured to determine whether the list link request status code is equal to a second preset status code if the list link download return value is not null, where the detailed content refers to step S1203; and a second processing unit, configured to enter a detail page for detection if the list link request status code is equal to the second preset status code, where the detailed content refers to step S1204.
Optionally, the ninth processing module includes: a third processing unit, configured to download a secondary list page link if the callback return value has a value, to obtain a secondary list page link download return value, and refer to the detailed content in step S1301; a third determining unit, configured to determine whether the secondary list page link download return value is empty, where the detailed content refers to step S1302; a fourth determining unit, configured to determine whether the secondary list page request status code is equal to a third preset status code if the secondary list page link download return value is not null, where the detailed content refers to step S1303; a fourth processing unit, configured to enter a secondary crawler analysis to obtain the number of detail links of the secondary list page if the secondary list page request status code is equal to the third preset status code, and refer to the detailed content in step S1304; a fifth judging unit, configured to judge whether the number of detail links of the secondary list page is greater than zero, where the detailed content refers to step S1305; a fifth processing unit, configured to indicate that the secondary list page template is changed if the number of detail links of the secondary list page is less than or equal to zero, and refer to the detailed content in step S1306; a sixth processing unit, configured to traverse and download the secondary list detail page links to obtain a secondary list detail page download return value if the number of the secondary list page detail links is greater than zero, where the detailed content refers to step S1307; a sixth judging unit, configured to judge whether the download return value of the secondary list detail page is empty, where the detailed content refers to step S1308; a seventh processing unit, configured to determine whether the second-level list detail page request status code is equal to a fourth preset status code if the second-level list detail page download return value is not null, where the detailed content refers to the details in step S1309; an eighth processing unit, configured to enter the detail page for detection if the second-level list detail page request status code is equal to a fourth preset status code, where the detailed content refers to step S1310.
Optionally, the second processing unit or the eighth processing unit includes: a first processing subunit, configured to download the first detail function to obtain a download return value of the first detail function, where the detailed content refers to step S1111; a determining subunit, configured to determine whether the first detail function download return value has a value, where details refer to step S1112; a second processing subunit, configured to indicate that the detail template is changed if the first detail function download return value has no value, and refer to step S1113 for details; a third processing subunit, configured to detect a title field and a content field if the first detail function download return value has a value, and refer to the detailed content in step S1114; a fourth processing subunit, configured to indicate that the detail template is not changed when the title field and the content field can be resolved, and refer to step S1115 for details; a fifth processing subunit, configured to indicate that the detail template is changed when the title field and the content field cannot be resolved, where the detailed content is described in step S1116.
Optionally, the second processing unit or the eighth processing unit further includes: and a sixth processing subunit, configured to, in the downloading process of the first detail function, if analysis is abnormal, fail to analyze the crawler script, and refer to the details in step S1117.
Optionally, the method further comprises: a tenth processing module, configured to download the second detail function to obtain a second detail function download return value, where the detailed content refers to step S1121; a fifth judging module, configured to judge whether the download return value of the second detail function is empty, where the detailed content refers to step S1122; an eleventh processing module, configured to perform link filtering if the second detail function download return value is null, where the detailed content refers to step S1123; a twelfth processing module, configured to indicate that the downloading of the detail page link fails if the download return value of the second detail function is not null, and refer to the step S1124 for the detailed content.
Optionally, the method further comprises: a thirteenth processing module, configured to search a crawler script library for a crawler script with data, where the detailed content refers to step S14; a fourteenth processing module, configured to store the configuration ids corresponding to all the searched data-containing crawler scripts in the data-containing crawler script database, where the detailed content refers to that in step S15; the fifteenth processing module is used for obtaining a corresponding data source according to the configuration id in the data crawler script database; and a sixteenth processing module, configured to monitor each field in the data source according to field verification information, where the field verification information includes at least one of time error monitoring, field character length monitoring, title length monitoring, configuration grouping error monitoring, channel error monitoring, website name error monitoring, website domain name error monitoring, grouping error monitoring, field error monitoring, and detail page URL monitoring.
The monitoring system for crawler template alteration in this embodiment is presented in the form of functional units, where a unit refers to an ASIC circuit, a processor and memory executing one or more software or fixed programs, and/or other devices that may provide the above-described functionality.
Further functional descriptions of the modules are the same as those of the corresponding embodiments, and are not repeated herein.
An embodiment of the present invention further provides an electronic device, as shown in fig. 12, where the electronic device includes one or more processors 141 and a memory 142, and one processor 141 is taken as an example in fig. 12.
The controller may further include: an input device 143 and an output device 144.
The processor 141, the memory 142, the input device 143, and the output device 144 may be connected by a bus or other means, and the bus connection is exemplified in fig. 12.
Processor 141 may be a Central Processing Unit (CPU). The Processor 141 may also be other general purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, or combinations thereof. A general purpose processor may be a microprocessor or any conventional processor or the like.
The memory 142, which is a non-transitory computer readable storage medium, may be used to store non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules corresponding to the monitoring method for crawler template changes in the embodiments of the present application. The processor 141 executes various functional applications and data processing of the server by running non-transitory software programs, instructions and modules stored in the memory 142, that is, implements the monitoring method for crawler template change of the above-described method embodiment.
The memory 142 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of a processing device operated by the server, and the like. Further, the memory 142 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, memory 142 optionally includes memory located remotely from processor 141, which may be connected to a network connection device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The input device 143 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the processing device of the server. The output device 144 may include a display device such as a display screen.
One or more modules are stored in memory 142 that, when executed by the one or more processors 141, perform the monitoring method for crawler template changes as shown in FIGS. 1-7.
It will be understood by those skilled in the art that all or part of the processes of the method according to the above embodiments may be implemented by instructing relevant hardware through a computer program, and the executed program may be stored in a computer-readable storage medium, and when executed, may include the processes of the embodiments of the monitoring method for crawler template change as described above. The storage medium may be a magnetic Disk, an optical Disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a Flash Memory (Flash Memory), a Hard Disk (Hard Disk Drive, abbreviated as HDD) or a Solid State Drive (SSD), etc.; the storage medium may also comprise a combination of memories of the kind described above.
Although the embodiments of the present invention have been described in conjunction with the accompanying drawings, those skilled in the art may make various modifications and variations without departing from the spirit and scope of the invention, and such modifications and variations fall within the scope defined by the appended claims.

Claims (10)

1. A method for monitoring changes of a crawler template is characterized by comprising the following steps:
searching a crawler script without data in a crawler script library, wherein the crawler script library comprises a crawler script without data and a crawler script with data, which cannot crawl the data, and each crawler script uniquely corresponds to a configuration id;
storing the configuration id corresponding to all the searched crawler scripts without data into a crawler script database without data;
adding the configuration id in the data-free crawler script database into a detection queue for data-free detection;
obtaining code information of a crawler script corresponding to the configuration id according to the configuration id;
acquiring a URL set in the crawler script according to the code information, and downloading each URL in the URL set in a traversing manner to acquire a downloading result value of each URL;
judging whether the URL downloading result value is null or not;
if the URL downloading result value is not null, judging whether the URL request response status code is equal to a first preset status code or not;
if the URL request response status code is equal to the first preset status code, crawler analysis is carried out on the URL to obtain the number of URL detail links;
judging whether the number of the URL detail links is greater than zero;
if the number of the URL detail links is less than or equal to zero, indicating that the list page template is changed;
if the number of the URL detail links is larger than zero, judging whether the callback return value of the callback function in the three-layer template has a value or not;
if the callback return value has no value, the URL has no secondary link, and two-layer template detection is carried out;
and if the callback return value has a value, the URL has a secondary link, and three-layer template detection is carried out.
2. The method for monitoring the change of the crawler template according to claim 1, wherein if the callback return value has no value, the URL has no secondary link, and the step of performing two-tier template detection includes:
if the callback return value has no value, downloading the list link to obtain a list link downloading return value;
judging whether the list link download return value is null or not;
if the list link download return value is not null, judging whether the list link request status code is equal to a second preset status code;
and if the list link request state code is equal to the second preset state code, entering a detail page for detection.
3. The method for monitoring the change of the crawler template according to claim 1, wherein if the callback return value has a value, the URL has a secondary link, and the step of performing three-tier template detection comprises:
if the callback return value has a value, downloading a secondary list page link to obtain a secondary list page link downloading return value;
judging whether the link downloading return value of the secondary list page is empty or not;
if the secondary list page link download return value is not null, judging whether the secondary list page request status code is equal to a third preset status code;
if the second-level list page request state code is equal to the third preset state code, entering second-level crawler analysis to obtain the number of detail links of the second-level list page;
judging whether the number of detail links of the secondary list page is greater than zero or not;
if the number of the detail links of the secondary list page is less than or equal to zero, indicating that the template of the secondary list page is changed;
if the number of the secondary list detail links is larger than zero, traversing and downloading the secondary list detail page links to obtain a secondary list detail page downloading return value;
judging whether the download return value of the secondary list detail page is empty or not;
if the download return value of the secondary list detail page is not null, judging whether the request status code of the secondary list detail page is equal to a fourth preset status code;
and if the second-level list detail page request status code is equal to a fourth preset status code, entering a detail page for detection.
4. The method for monitoring the change of the crawler template according to claim 2 or 3, wherein the step of entering the detail page for detection comprises:
downloading the first detail function to obtain a first detail function downloading return value;
judging whether the first detail function downloading return value has a value or not;
if the first detail function download return value has no value, the detail template is changed;
if the first detail function download return value has a value, detecting a title field and a content field;
when the title field and the content field can be analyzed, the detail template is not changed;
when the title field and the content field cannot be resolved, the detail template is changed.
5. The method for monitoring the change of the crawler template according to claim 2 or 3, wherein the step of entering the detail page for detection further comprises:
in the downloading process of the first detail function, if the analysis is abnormal, the analysis of the crawler script fails.
6. The method for monitoring changes to a crawler template according to claim 2, wherein if the list link download return value is null and/or the list link request status code is not equal to the second predetermined status code, further comprising:
downloading a second detail function to obtain a second detail function downloading return value;
judging whether the download return value of the second detail function is null or not;
if the download return value of the second detail function is null, performing link filtering;
and if the download return value of the second detail function is not null, indicating that the link download of the detail page fails.
7. The method for monitoring changes to a crawler template of claim 1, further comprising:
searching a crawler script with data in a crawler script library;
storing the configuration id corresponding to all the searched crawler scripts with data into a crawler script database with data;
obtaining a corresponding data source according to the configuration id in the database with the data crawler script;
monitoring each field in the data source according to field verification information, wherein the field verification information comprises at least one of time error monitoring, field character length monitoring, title length monitoring, configuration grouping error monitoring, channel error monitoring, website name error monitoring, website domain name error monitoring, grouping error monitoring, field error monitoring and detail page URL monitoring.
8. A system for monitoring changes to a crawler template, comprising:
the system comprises a first processing module, a second processing module and a third processing module, wherein the first processing module is used for searching a crawler script without data in a crawler script library, the crawler script library comprises a crawler script without data and a crawler script with data, the crawler script cannot crawl the data, the crawler script can crawl the data, and each crawler script uniquely corresponds to a configuration id;
the second processing module is used for storing the configuration ids corresponding to all the searched data-free crawler scripts into a data-free crawler script database;
the third processing module is used for adding the configuration id in the data-free crawler script database into a detection queue for data-free detection;
the fourth processing module is used for obtaining code information of the crawler script corresponding to the configuration id according to the configuration id;
the fifth processing module is used for obtaining a URL set in the crawler script according to the code information, and downloading each URL in the URL set in a traversing manner to obtain a downloading result value of each URL;
the first judgment module is used for judging whether the URL downloading result value is null or not;
the second judgment module is used for judging whether the URL request response status code is equal to the first preset status code or not if the URL downloading result value is not null;
the sixth processing module is used for performing crawler analysis on the URL to obtain the URL detail link quantity if the URL request response status code is equal to the first preset status code;
the third judging module is used for judging whether the number of the URL detail links is greater than zero;
the seventh processing module is used for indicating the change of the list page template if the number of the URL detail links is less than or equal to zero;
the fourth judging module is used for judging whether the callback return value of the callback function in the three-layer template has a value or not if the number of the URL detail links is greater than zero;
the eighth processing module is used for detecting the two-layer template if the callback return value has no value, and the URL has no secondary link;
and the ninth processing module is used for detecting the three-layer template if the callback return value has a value, and the URL has a secondary link.
9. An electronic device, comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores a computer program executable by the at least one processor, the computer program being executable by the at least one processor to cause the at least one processor to perform the method of monitoring for crawler template changes of any one of claims 1-7.
10. A computer-readable storage medium storing computer instructions for causing a computer to perform the method for monitoring crawler template changes according to any one of claims 1-7.
CN202011265722.1A 2020-11-13 2020-11-13 Crawler template change monitoring method and system, electronic equipment and storage medium Active CN112100083B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011265722.1A CN112100083B (en) 2020-11-13 2020-11-13 Crawler template change monitoring method and system, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011265722.1A CN112100083B (en) 2020-11-13 2020-11-13 Crawler template change monitoring method and system, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN112100083A true CN112100083A (en) 2020-12-18
CN112100083B CN112100083B (en) 2021-02-02

Family

ID=73785518

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011265722.1A Active CN112100083B (en) 2020-11-13 2020-11-13 Crawler template change monitoring method and system, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112100083B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113051499A (en) * 2021-03-23 2021-06-29 北京智慧星光信息技术有限公司 Method and system for monitoring data acquisition amount, electronic equipment and storage medium
CN113965555A (en) * 2021-10-21 2022-01-21 北京值得买科技股份有限公司 Method, device, equipment and storage medium for downloading parameterized crawler

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101471818A (en) * 2007-12-24 2009-07-01 北京启明星辰信息技术股份有限公司 Detection method and system for malevolence injection script web page
CN103248625A (en) * 2013-04-27 2013-08-14 北京京东尚科信息技术有限公司 Monitoring method and system for abnormal operation of web crawler
CN104156665A (en) * 2014-07-22 2014-11-19 杭州安恒信息技术有限公司 Web page tampering monitoring method
CN105656707A (en) * 2014-11-18 2016-06-08 阿里巴巴集团控股有限公司 Method and system for testing web crawler
CN105975395A (en) * 2016-05-30 2016-09-28 深圳市华傲数据技术有限公司 Website state reconnaissance method and device
US20170068735A1 (en) * 2015-09-08 2017-03-09 MOLBASE (Shanghai) Biotechnology Co., Ltd . Task-crawling system and task-crawling method for distributed crawler system
CN109298987A (en) * 2017-07-25 2019-02-01 北京国双科技有限公司 A kind of method and device detecting web crawlers operating status
CN110147473A (en) * 2017-08-28 2019-08-20 北京国双科技有限公司 A kind of crawling method and device of crawler
CN111159514A (en) * 2018-11-07 2020-05-15 中移(苏州)软件技术有限公司 Method, device and equipment for detecting task effectiveness of web crawler and storage medium

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101471818A (en) * 2007-12-24 2009-07-01 北京启明星辰信息技术股份有限公司 Detection method and system for malevolence injection script web page
CN103248625A (en) * 2013-04-27 2013-08-14 北京京东尚科信息技术有限公司 Monitoring method and system for abnormal operation of web crawler
CN104156665A (en) * 2014-07-22 2014-11-19 杭州安恒信息技术有限公司 Web page tampering monitoring method
CN105656707A (en) * 2014-11-18 2016-06-08 阿里巴巴集团控股有限公司 Method and system for testing web crawler
US20170068735A1 (en) * 2015-09-08 2017-03-09 MOLBASE (Shanghai) Biotechnology Co., Ltd . Task-crawling system and task-crawling method for distributed crawler system
CN105975395A (en) * 2016-05-30 2016-09-28 深圳市华傲数据技术有限公司 Website state reconnaissance method and device
CN109298987A (en) * 2017-07-25 2019-02-01 北京国双科技有限公司 A kind of method and device detecting web crawlers operating status
CN110147473A (en) * 2017-08-28 2019-08-20 北京国双科技有限公司 A kind of crawling method and device of crawler
CN111159514A (en) * 2018-11-07 2020-05-15 中移(苏州)软件技术有限公司 Method, device and equipment for detecting task effectiveness of web crawler and storage medium

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113051499A (en) * 2021-03-23 2021-06-29 北京智慧星光信息技术有限公司 Method and system for monitoring data acquisition amount, electronic equipment and storage medium
CN113051499B (en) * 2021-03-23 2023-11-21 北京智慧星光信息技术有限公司 Method, system, electronic equipment and storage medium for monitoring data acquisition quantity
CN113965555A (en) * 2021-10-21 2022-01-21 北京值得买科技股份有限公司 Method, device, equipment and storage medium for downloading parameterized crawler
CN113965555B (en) * 2021-10-21 2024-04-12 北京值得买科技股份有限公司 Parameterized crawler downloading method, parameterized crawler downloading device, parameterized crawler downloading equipment and storage medium

Also Published As

Publication number Publication date
CN112100083B (en) 2021-02-02

Similar Documents

Publication Publication Date Title
US10387290B2 (en) Processing automation scripts of software
CN112100083B (en) Crawler template change monitoring method and system, electronic equipment and storage medium
US8079018B2 (en) Test impact feedback system for software developers
CN107943838B (en) Method and system for automatically acquiring xpath generated crawler script
CN109376291B (en) Website fingerprint information scanning method and device based on web crawler
CN111159514B (en) Method, device and equipment for detecting task effectiveness of web crawler and storage medium
CN109766719B (en) Sensitive information detection method and device and electronic equipment
CN104182548B (en) Webpage updates processing method and processing device
CN107085549B (en) Method and device for generating fault information
CN101539974A (en) Detecting, capturing and processing valid login credentials
US10698963B2 (en) System and method for monitoring internet activity
CN113779571A (en) WebShell detection device, WebShell detection method and computer-readable storage medium
US10353984B2 (en) Identification of sequential browsing operations
CN114564947A (en) Rail transit signal fault operation and maintenance method and device and electronic equipment
CN109614308A (en) Test data generating method, device and computer equipment based on crawler log
CN107844515B (en) Data compliance checking method and device
CN103248513A (en) Network information data collection method and system based on Office suite
CN108574585B (en) System fault solution obtaining method and device
US20140337069A1 (en) Deriving business transactions from web logs
CN114238733A (en) Key information extraction method and device, computer storage medium and electronic equipment
CN115310011A (en) Page display method and system and readable storage medium
US10331621B1 (en) System and method for displaying a sample of uniform and outlier rows from a file
US20160328441A1 (en) Search token mnemonic replacement
CN108629012B (en) Intelligent verification method and system for forensic data analysis accuracy
CN112699373A (en) Method and device for detecting SQL injection vulnerability in batch

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant