CN106997353B - Method and device for monitoring webpage version change - Google Patents

Method and device for monitoring webpage version change Download PDF

Info

Publication number
CN106997353B
CN106997353B CN201610045870.XA CN201610045870A CN106997353B CN 106997353 B CN106997353 B CN 106997353B CN 201610045870 A CN201610045870 A CN 201610045870A CN 106997353 B CN106997353 B CN 106997353B
Authority
CN
China
Prior art keywords
website
web page
monitored
webpage
link address
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610045870.XA
Other languages
Chinese (zh)
Other versions
CN106997353A (en
Inventor
张祎博
兰光明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Gridsum Technology Co Ltd
Original Assignee
Beijing Gridsum Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Gridsum Technology Co Ltd filed Critical Beijing Gridsum Technology Co Ltd
Priority to CN201610045870.XA priority Critical patent/CN106997353B/en
Publication of CN106997353A publication Critical patent/CN106997353A/en
Application granted granted Critical
Publication of CN106997353B publication Critical patent/CN106997353B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Abstract

The invention discloses a method and a device for monitoring webpage version change, relates to the technical field of internet, and realizes the monitoring of the webpage version change according to the change of judging a webpage link address and a fixed structure identifier, thereby improving the efficiency and the accuracy of monitoring the webpage version change. The main technical scheme of the invention is as follows: detecting whether the webpage link address of the website to be monitored changes; if not, judging whether the number of the webpage link addresses meets a preset range or not; if not, determining that the web page of the website to be monitored has version change. The invention is mainly used for monitoring webpage version change.

Description

Method and device for monitoring webpage version change
Technical Field
The invention relates to the technical field of internet, in particular to a method and a device for monitoring webpage version change.
Background
In order to obtain data information in internet websites in batches, the data information can be crawled through a web crawler technology. The crawling rule refers to that data information located on a fixed node is crawled in the same type of webpages of the same website according to an Html source code, and when the data information is obtained, a webpage path of a node of a webpage containing the data information needs to be specified in a crawling program. When the updated contents in the same type of web pages of the same website are crawled, the fixed crawling rule is only suitable for the condition of fixed web page layout, and after the web pages are modified, the crawling program needs to be modified again to crawl correct data information, so that the monitoring on the web page modification is an important subject.
Currently, when determining whether a web page is modified, it is usually determined whether the web page is modified by manually analyzing whether the crawled data information is correct. However, the time required from web page reprinting to artificially discovering that crawled data is wrong data and then confirming that web page reprinting occurs is long, so that the web page reprinting monitoring efficiency is low, the crawled wrong data information is modified, manual operation is increased, and the efficiency and accuracy of crawling data are low.
Disclosure of Invention
In view of this, embodiments of the present invention provide a method and an apparatus for monitoring web page revising, which mainly aim to improve the efficiency of monitoring web page revising and the efficiency and accuracy of crawling data of web pages.
By the technical scheme, the method for monitoring the webpage revising, provided by the invention, comprises the following steps:
detecting whether the webpage link address of the website to be monitored changes;
if not, judging whether the number of the webpage link addresses meets a preset range or not;
if not, determining that the web page of the website to be monitored has version change.
By the technical scheme, the invention provides a monitoring device for webpage version change, which comprises:
the detection unit is used for detecting whether the webpage link address of the website to be monitored changes;
the judging unit is used for judging whether the number of the webpage link addresses accords with a preset range or not if the detecting unit detects that the webpage link addresses of the website to be monitored do not change;
and the determining unit is used for determining that the web page of the website to be monitored has version change if the judging unit judges that the number of the web page link addresses does not conform to the preset range.
By the technical scheme, the technical scheme provided by the embodiment of the invention at least has the following advantages:
the embodiment of the invention provides a method and a device for monitoring web page version change. Compared with the prior art that whether crawled data information is correct through manual analysis and whether web page reprinting occurs is judged, the web page link address of a monitoring website and the fixed structure identification of the web page are detected to determine whether the web page reprinting occurs, if the web page link address and the fixed structure identification of the web page are changed, prompt information of the web page reprinting is output, so that a crawling program is revised again, manual operation is avoided, the crawling program is revised immediately when the web page reprinting is achieved, and therefore monitoring efficiency of the web page reprinting and efficiency and accuracy of web page crawling are improved.
The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:
FIG. 1 is a flowchart illustrating a method for monitoring web page revisions according to an embodiment of the present invention;
FIG. 2 is a flow chart illustrating another method for monitoring web page revisions provided by an embodiment of the invention;
fig. 3 is a block diagram illustrating a monitoring apparatus for word web page revising according to an embodiment of the present invention;
fig. 4 is a block diagram of another monitoring apparatus for web page revisions according to an embodiment of the present invention.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
The embodiment of the invention provides a method for monitoring webpage revising, which comprises the following steps of:
101. and detecting whether the webpage link address of the website to be monitored changes.
The detecting whether the web page link address changes can be detecting whether letters, symbols and the like in the web page link address are different from those in the original web page link address, and can also extract path information of the web page link address through Xpath, wherein Xpath is an XML path language (a subset of a standard general markup language) which is a language for determining the position of a certain part in an XML document, and then matching is performed by using a regular expression method, and the embodiment of the invention is not particularly limited. For example, the original webpage link address in the website is monitored as "www.abcde.aa.com", and whether the existing webpage link address "www.abcde.ac.com" and the original webpage link address are changed or not is detected.
102. And if the web page link address of the website to be monitored is detected to be not changed, judging whether the number of the web page link addresses meets a preset range.
The preset range may be set according to the number of specific web page link addresses that can be displayed in a web page, and the number of web page link addresses is close to the number of original web page link addresses. For example, the number of the original web page link addresses is 10, and the preset range may be set to 8 to 12. By judging whether the number of the webpage link addresses meets the preset range or not, the phenomenon that the version change occurs when a small number of webpages with overdue numbers, ongoing maintenance or incorrect structures exist due to too few directory page link addresses or content page link addresses is immediately discovered, and therefore the accuracy of webpage crawling data is improved.
For the embodiment of the present invention, the steps parallel to step 102 are: and if the change of the webpage link address of the website to be monitored is detected, determining that the webpage of the website to be monitored is modified.
103. And if the number of the webpage link addresses does not accord with the preset range, determining that the webpage of the website to be monitored has version change.
The method and the device have the advantages that the web page of the website to be monitored is determined to be changed, so that the crawling program is revised again, and the accuracy of the web page crawling data is improved.
For the embodiment of the present invention, specific application scenarios may be as follows, but are not limited to the following scenarios, including: in the website to be monitored, the original webpage link address is www.abcde.ac.com, similar 9 webpage link addresses are provided, the existing webpage link address www.abcde.ac.com is detected to be not changed, the other 8 webpage link addresses are detected to be not changed, the number of the existing webpage link addresses is 10, the preset range is 8-11, the preset range is met, and the condition that the webpage of the website to be monitored is not changed is confirmed.
The embodiment of the invention provides a method for monitoring web page version change, which comprises the steps of firstly detecting whether a web page link address of a website to be monitored has change, if not, judging whether the number of the web page link address accords with a preset range, and if not, determining that the web page of the website to be monitored has version change. Compared with the prior art that whether crawled data information is correct through manual analysis and whether web page reprinting occurs is judged, the web page link address of a monitoring website and the fixed structure identification of the web page are detected to determine whether the web page reprinting occurs, if the web page link address and the fixed structure identification of the web page are changed, prompt information of the web page reprinting is output, so that a crawling program is revised again, manual operation is avoided, the crawling program is revised immediately when the web page reprinting is achieved, and therefore monitoring efficiency of the web page reprinting and efficiency and accuracy of web page crawling are improved.
The embodiment of the invention provides another monitoring method for webpage reprinting, as shown in fig. 2, the method comprises the following steps:
201. and acquiring the webpage link address of the website to be monitored according to a preset time interval.
The webpage link address of the monitoring website comprises a directory page link address and a content page link address of the website, the preset time interval is set by specific service requirements, and the preset time interval can be smaller than the time interval of twice website crawling program operation, so that whether the website to be monitored is changed or not is found before the next website crawling program operation is ensured. The content page link address can be extracted from a directory webpage corresponding to the directory page link address in the website to be monitored. For example, a news website includes directory page link addresses corresponding to different news categories, such as international news, social news, financial news, and the like, and the specific content of news can be browsed by clicking the link address of a corresponding content page and entering a directory page corresponding to the international news if the directory page includes a news title corresponding to the international news. For another example, the time interval between the two crawling processes is 5 minutes, and the preset time interval may be set to 1 minute, that is, the web page link address of the website to be monitored is obtained every 1 minute. The webpage link address of the website to be monitored is obtained according to the preset time interval, so that whether the webpage of the website to be monitored is changed or not can be found in time before the crawling program is executed, and the webpage changing monitoring efficiency is improved.
202. And detecting whether the webpage link address of the website to be monitored changes.
The detecting whether the web page link address changes can be detecting whether letters, symbols and the like in the web page link address are different from those in the original web page link address, extracting path information of the web page link address through Xpath, and matching by using a regular expression method, and embodiments of the present invention are not limited specifically.
For the embodiment of the present invention, step 202 further includes: and configuring the fixed structure identification for the contents in different areas in the content page corresponding to the content page link address. The fixed structure identifier can be configured through an Xpath, the Xpath and the corresponding information of the fixed structure identifier are stored in a configuration file of the monitoring program, the monitoring program is loaded when being started, and when the directory webpage is updated and detected, only the fixed structure identifier appointed by the Xpath needs to be searched whether to be configured at the position appointed by the Xpath. By configuring the fixed structure identifiers for the contents in different areas in the content page corresponding to the content page link address, the situation that whether the web page is changed or not can not be determined when the fixed structure identifiers are changed is avoided, and therefore the monitoring efficiency of web page changing is improved.
203. And if the web page link address of the website to be monitored is detected to be not changed, judging whether the number of the web page link addresses meets a preset range.
The preset range may be set according to the number of specific web page link addresses that can be displayed in a web page, and the number of web page link addresses is close to the number of original web page link addresses.
For the embodiment of the present invention, the steps parallel to step 203: and if the change of the webpage link address of the website to be monitored is detected, outputting the prompting information of the version change.
204. And if the number of the webpage link addresses does not accord with the preset range, determining that the webpage of the website to be monitored has version change.
The method and the device have the advantages that the web page of the website to be monitored is determined to be changed, so that the crawling program is revised again, and the accuracy of the web page crawling data is improved.
For the embodiment of the present invention, step 204 may specifically be: and if the number of the directory page link addresses does not accord with the preset range, determining that the web page of the website to be monitored has version change. The number of the directory page link addresses is judged to be in accordance with the preset range, so that the condition that the web page is not changed is determined, the condition that whether the number of the content page link addresses is changed or not is continuously judged after the web page is changed is avoided, and the detection efficiency of web page changing is improved.
For the embodiment of the present invention, step 204 may specifically be: if the number of the content page link addresses meets a preset range, extracting a fixed structure identification of a webpage corresponding to the content page link address in the website to be monitored, detecting whether the fixed structure identification changes, and if so, determining that the webpage of the website to be monitored has a revision. The fixed structure identifiers of the web pages in the website to be monitored are fixed structure identifiers of the content in different areas in the content web pages of the monitoring website, the fixed structure identifiers are the same in the same type of web pages of the same website, the existing changes comprise that the fixed structure identifiers are different from the original fixed structure identifiers, and the fixed structure identifiers are increased or decreased, and the embodiment of the invention is not particularly limited. The fixed structure identifier may be detected by detecting whether the location identifier of the fixed structure identifier corresponding to the Xpath is correct, and may also be detected by detecting whether the Xpath specified location exists in the current web page. For example, in a content web page of a forum website, as the number of replies increases, the posts, posting time, and posting content in the web page all change, but the identification characters of "posts", "posting time", and "posting content" in the web page do not change, and "posts", "posting time", and "posting content" are fixed structure identifications in the web page. Whether the fixed structure identification changes or not is detected, so that the webpage is determined to be changed, and the phenomenon that the change of the version caused by the change of the fixed structure identification is ignored is avoided, so that the accuracy of webpage change monitoring is improved.
For the embodiment of the present invention, specific application scenarios may be as follows, but are not limited to the following scenarios, including: in a certain news website to be monitored, a directory page link address 'mini.eastday.com.shehui' of social news in the news website is obtained every 1 minute of a preset time interval, a content page link address 'mini.eastday.com.shehui.20151225' in a clicked directory page is configured as a fixed structure identifier, the directory page link address and the content page link address are detected to be not changed with the directory page link address and the content page link address monitored last time, 20 directory page link addresses and content page link addresses in a webpage are judged to be in number and accord with preset ranges of 18-22, the directory page link address and the content page link address are confirmed to be not changed, and the fixed structure identifier 'content', 'comment', 'click times' in the content page corresponding to the content page link address is further extracted, And if the fixed structure identification is detected to change, confirming that the webpage of the news website is changed.
The other monitoring method for web page version change provided by the embodiment of the invention firstly detects whether the web page link address of the website to be monitored has change, if not, judges whether the number of the web page link addresses meets the preset range, and if not, determines that the web page of the website to be monitored has version change. Compared with the prior art that whether crawled data information is correct through manual analysis and whether web page reprinting occurs is judged, the web page link address of a monitoring website and the fixed structure identification of the web page are detected to determine whether the web page reprinting occurs, if the web page link address and the fixed structure identification of the web page are changed, prompt information of the web page reprinting is output, so that a crawling program is revised again, manual operation is avoided, the crawling program is revised immediately when the web page reprinting is achieved, and therefore monitoring efficiency of the web page reprinting and efficiency and accuracy of web page crawling are improved.
The embodiment of the apparatus corresponds to the embodiment of the method, and for convenience of reading, details in the embodiment of the apparatus are not repeated one by one, but it should be clear that the apparatus in the embodiment can correspondingly implement all the contents in the embodiment of the method.
Further, as a specific implementation of the method shown in fig. 1, an embodiment of the present invention provides a monitoring apparatus for web page reprinting, where as shown in fig. 3, the apparatus may include: a detection unit 31, a judgment unit 32, and a confirmation unit 33.
The detection unit 31 is configured to detect whether a web page link address of a website to be monitored changes;
the judging unit 32 is configured to, if the detecting unit detects that there is no change in the web link address of the website to be monitored, judge whether the number of the web link addresses meets a preset range;
the determining unit 33 is configured to determine that the web page of the website to be monitored has a revision if the determining unit determines that the number of the web page link addresses does not meet the preset range.
The monitoring device for web page version change provided by the embodiment of the invention firstly detects whether the web page link address of the website to be monitored has change, if not, judges whether the number of the web page link addresses meets the preset range, and if not, determines that the web page of the website to be monitored has version change. Compared with the prior art that whether crawled data information is correct through manual analysis and whether web page reprinting occurs is judged, the web page link address of a monitoring website and the fixed structure identification of the web page are detected to determine whether the web page reprinting occurs, if the web page link address and the fixed structure identification of the web page are changed, prompt information of the web page reprinting is output, so that a crawling program is revised again, manual operation is avoided, the crawling program is revised immediately when the web page reprinting is achieved, and therefore monitoring efficiency of the web page reprinting and efficiency and accuracy of web page crawling are improved.
The embodiment of the apparatus corresponds to the embodiment of the method, and for convenience of reading, details in the embodiment of the apparatus are not repeated one by one, but it should be clear that the apparatus in the embodiment can correspondingly implement all the contents in the embodiment of the method.
Further, as a specific implementation of the method shown in fig. 2, an embodiment of the present invention provides another monitoring apparatus for web page reprinting, where as shown in fig. 4, the apparatus may include: detection unit 41, determination unit 42, confirmation unit 43, acquisition unit 44, and configuration unit 45.
The detection unit 41 is configured to detect whether a web page link address of a website to be monitored changes;
a determining unit 42, configured to determine whether the number of the web page link addresses meets a preset range if the detecting unit 41 detects that there is no change in the web page link addresses of the website to be monitored;
a determining unit 43, configured to determine that the web page of the website to be monitored has a revision if the determining unit 42 determines that the number of the web page link addresses does not meet the preset range.
Further, the apparatus further comprises:
the obtaining unit 44 is configured to obtain a web page link address of a website to be monitored according to a preset time interval.
The confirming unit 43 is specifically configured to determine that the web page of the website to be monitored has a revision if the determining unit 42 determines that the number of the directory page link addresses does not meet the preset range; and/or
The determining unit 43 is specifically configured to, if the determining unit 42 determines that the number of the content page link addresses meets the preset range, extract a fixed structure identifier of a webpage corresponding to the content page link address in the website to be monitored, detect whether the fixed structure identifier changes, and if so, determine that the webpage of the website to be monitored has a revision.
Further, the apparatus further comprises:
a configuring unit 45, configured to configure fixed structure identifiers for the contents in different areas of the content page corresponding to the content page link address.
The other monitoring device for web page version change provided by the embodiment of the invention firstly detects whether the web page link address of the website to be monitored has change, if not, judges whether the number of the web page link addresses meets the preset range, and if not, determines that the web page of the website to be monitored has version change. Compared with the prior art that whether crawled data information is correct through manual analysis and whether web page reprinting occurs is judged, the web page link address of a monitoring website and the fixed structure identification of the web page are detected to determine whether the web page reprinting occurs, if the web page link address and the fixed structure identification of the web page are changed, prompt information of the web page reprinting is output, so that a crawling program is revised again, manual operation is avoided, the crawling program is revised immediately when the web page reprinting is achieved, and therefore monitoring efficiency of the web page reprinting and efficiency and accuracy of web page crawling are improved.
The monitoring device for web page reprinting comprises a processor and a memory, wherein the detection unit, the judgment unit, the confirmation unit and the like are all stored in the memory as program units, and the processor executes the program units stored in the memory to realize corresponding functions.
The processor comprises a kernel, and the kernel calls the corresponding program unit from the memory. The kernel can be set to be one or more than one, and the problem of low evaluation accuracy of the advertisement putting effect is solved by adjusting the kernel parameters.
The memory may include volatile memory in a computer readable medium, Random Access Memory (RAM) and/or nonvolatile memory such as Read Only Memory (ROM) or flash memory (flash RAM), and the memory includes at least one memory chip.
The present application further provides a computer program product adapted to perform program code for initializing the following method steps when executed on a data processing device: detecting whether the webpage link address of the website to be monitored changes; if not, judging whether the number of the webpage link addresses meets a preset range or not; if not, determining that the web page of the website to be monitored has version change.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). The memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims (8)

1. A method for monitoring webpage revising is characterized by comprising the following steps:
configuring fixed structure identifications for contents in different areas in the content page corresponding to the content page link address;
acquiring a webpage link address of a website to be monitored according to a preset time interval; the preset time interval is smaller than the time interval of two times of website crawling program operation;
detecting whether the webpage link address of the website to be monitored changes; the detecting whether the webpage link address of the website to be monitored changes includes: extracting path information of the webpage link address through Xpath, and matching by using a regular expression method;
if not, judging whether the number of the webpage link addresses meets a preset range or not;
if not, determining that the web page of the website to be monitored has version change;
if yes, extracting a fixed structure identifier of a webpage corresponding to the content page link address in the website to be monitored, and detecting whether the fixed structure identifier changes;
and if so, determining that the web page of the website to be monitored has version change.
2. The method for monitoring web page revisions according to claim 1, wherein the web page link address of the monitoring website comprises a directory page link address and a content page link address of the website.
3. The method for monitoring web page revising according to claim 2, wherein if the number of the web page link addresses is judged not to be in accordance with the preset range, determining that the web page of the website to be monitored has revising comprises:
if the number of the directory page link addresses does not accord with the preset range, determining that the web page of the website to be monitored has version change; and/or
If the number of the content page link addresses meets a preset range, extracting a fixed structure identification of a webpage corresponding to the content page link address in the website to be monitored, detecting whether the fixed structure identification changes, and if so, determining that the webpage of the website to be monitored has a revision.
4. A monitoring device for web page revising, comprising:
the configuration unit is used for configuring fixed structure identifiers for the contents in different areas in the content page corresponding to the content page link address;
the detection unit is used for detecting whether the webpage link address of the website to be monitored changes; the detecting whether the webpage link address of the website to be monitored changes includes: extracting path information of the webpage link address through Xpath, and matching by using a regular expression method;
the judging unit is used for judging whether the number of the webpage link addresses accords with a preset range or not if the detecting unit detects that the webpage link addresses of the website to be monitored do not change;
if yes, extracting a fixed structure identifier of a webpage corresponding to the content page link address in the website to be monitored, and detecting whether the fixed structure identifier changes;
if so, determining that the web page of the website to be monitored has version change;
the determining unit is used for determining that the web page of the website to be monitored has version change if the judging unit judges that the number of the web page link addresses does not conform to the preset range;
wherein the apparatus further comprises:
and the acquisition unit is used for acquiring the webpage link address of the website to be monitored according to the preset time interval.
5. The apparatus for monitoring web page revisions according to claim 4, wherein the web page link address of the monitoring website comprises a directory page link address and a content page link address of the website.
6. The web page revision monitoring apparatus according to claim 5,
the determining unit is specifically configured to determine that the web page of the website to be monitored has a revision if the determining unit determines that the number of the directory page link addresses does not conform to a preset range; and/or
The determining unit is specifically configured to, if the determining unit determines that the number of the content page link addresses meets the preset range, extract a fixed structure identifier of a webpage corresponding to the content page link address in the website to be monitored, detect whether the fixed structure identifier changes, and if so, determine that the webpage of the website to be monitored has a revision.
7. A storage medium, characterized in that the storage medium comprises a stored program, wherein, when the program runs, a device where the storage medium is located is controlled to execute the method for monitoring web page revisions according to any one of claims 1 to 3.
8. A processor, characterized in that the processor is configured to run a program, wherein the program is configured to execute the method for monitoring web page revisions according to any one of claims 1 to 3 when running.
CN201610045870.XA 2016-01-22 2016-01-22 Method and device for monitoring webpage version change Active CN106997353B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610045870.XA CN106997353B (en) 2016-01-22 2016-01-22 Method and device for monitoring webpage version change

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610045870.XA CN106997353B (en) 2016-01-22 2016-01-22 Method and device for monitoring webpage version change

Publications (2)

Publication Number Publication Date
CN106997353A CN106997353A (en) 2017-08-01
CN106997353B true CN106997353B (en) 2021-08-10

Family

ID=59428435

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610045870.XA Active CN106997353B (en) 2016-01-22 2016-01-22 Method and device for monitoring webpage version change

Country Status (1)

Country Link
CN (1) CN106997353B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110717131B (en) * 2018-06-27 2022-07-05 北京国双科技有限公司 Page revising monitoring method and related system

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102375830A (en) * 2010-08-13 2012-03-14 富士通株式会社 Webpage updating judging method and device as well as website synchronization method and device
CN104166545A (en) * 2014-07-25 2014-11-26 北京搜狗科技发展有限公司 Webpage resource sniffing method and device

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101369253B1 (en) * 2012-05-10 2014-03-06 주식회사 안랩 Apparatus and method for blocking information falsification of web page
CN104182426A (en) * 2013-05-28 2014-12-03 腾讯科技(深圳)有限公司 Display method and display device of update website content

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102375830A (en) * 2010-08-13 2012-03-14 富士通株式会社 Webpage updating judging method and device as well as website synchronization method and device
CN104166545A (en) * 2014-07-25 2014-11-26 北京搜狗科技发展有限公司 Webpage resource sniffing method and device

Also Published As

Publication number Publication date
CN106997353A (en) 2017-08-01

Similar Documents

Publication Publication Date Title
CN108270629B (en) Website visitor behavior monitoring method and device
CN108304410B (en) Method and device for detecting abnormal access page and data analysis method
CN108256888B (en) Landing page acquisition method, website server and network advertisement monitoring system
CN107015986B (en) Method and device for crawling webpage by crawler
CN106033450B (en) Advertisement blocking method and device and browser
CN107045507B (en) Webpage crawling method and device
CN106802899B (en) Webpage text extraction method and device
CN110955846A (en) Propagation path diagram generation method and device
CN108874379B (en) Page processing method and device
CN106682044B (en) Data processing method and device
CN105354224B (en) The treating method and apparatus of knowledge data
CN109558548B (en) Method for eliminating CSS style redundancy and related product
CN108255891B (en) Method and device for judging webpage type
CN106657422B (en) Method, device and system for crawling website page and storage medium
CN109582883B (en) Column page determination method and device
CN106997353B (en) Method and device for monitoring webpage version change
CN110929188A (en) Method and device for rendering server page
CN108268775B (en) Web vulnerability detection method and device, electronic equipment and storage medium
CN115297042A (en) Method for detecting consistency of web pages under different networks and related equipment
CN109558549B (en) Method for eliminating CSS style redundancy and related product
CN110889051A (en) Page hyperlink detection method, device and equipment
CN111125087A (en) Data storage method and device
CN110717131B (en) Page revising monitoring method and related system
CN106776654B (en) Data searching method and device
CN103955548A (en) Method and device for rendering web page

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: 100083 No. 401, 4th Floor, Haitai Building, 229 North Fourth Ring Road, Haidian District, Beijing

Applicant after: Beijing Guoshuang Technology Co.,Ltd.

Address before: 100086 Cuigong Hotel, 76 Zhichun Road, Shuangyushu District, Haidian District, Beijing

Applicant before: Beijing Guoshuang Technology Co.,Ltd.

GR01 Patent grant
GR01 Patent grant