WO2016078479A1 - Method and device for monitoring web page changes - Google Patents

Method and device for monitoring web page changes Download PDF

Info

Publication number
WO2016078479A1
WO2016078479A1 PCT/CN2015/090969 CN2015090969W WO2016078479A1 WO 2016078479 A1 WO2016078479 A1 WO 2016078479A1 CN 2015090969 W CN2015090969 W CN 2015090969W WO 2016078479 A1 WO2016078479 A1 WO 2016078479A1
Authority
WO
WIPO (PCT)
Prior art keywords
page
different times
webpage
difference
loaded
Prior art date
Application number
PCT/CN2015/090969
Other languages
French (fr)
Chinese (zh)
Inventor
梁捷
张云龙
钟国英
刘洋
Original Assignee
广州市动景计算机科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 广州市动景计算机科技有限公司 filed Critical 广州市动景计算机科技有限公司
Publication of WO2016078479A1 publication Critical patent/WO2016078479A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking

Definitions

  • the present invention relates to the field of mobile internet technologies, and in particular, to a webpage change monitoring method and apparatus.
  • the Internet is known for its rapid iterations. Web applications perform product releases and operational content updates several times a week. Therefore, enterprise web monitoring of products has become one of the focuses of corporate web management.
  • an object of the present invention is to provide a webpage change monitoring method and apparatus, which can find out the difference of webpages at different moments by structurally recording and comparing webpage data at different times, and at the same time, will find out The difference is marked on the screenshot of the webpage, which improves the accuracy of the webpage difference comparison and makes it easier to monitor the webpage.
  • the webpage change monitoring method provided by the invention includes:
  • the process of recording page data loaded at different times on the same webpage into corresponding specific data structures includes:
  • the element label, the element place information, the element attribute and the attribute value, and the hash value of the element are serialized and stored as a specific data structure.
  • the designated node is a DOM node in the page or a DOM node that filters all the DOM nodes.
  • determining differences between page data after loading the same webpage at different times includes:
  • the LCS algorithm compares the specific data structures recorded at different times to determine the difference between the page data loaded by the same web page at different times.
  • the step of marking the difference respectively on the screenshot of the page at different moments comprises:
  • the difference is marked on a screenshot of the page at different times based on the type of difference and the location of the difference on the page.
  • the webpage change monitoring device includes:
  • a page data recording unit configured to separately record page data loaded by the same webpage at different times; wherein the page data loaded at different times of the same webpage is recorded as a corresponding specific data structure;
  • the page screenshot unit is configured to save a screenshot of a page loaded at different times on the same webpage
  • a difference determining unit configured to compare a specific data structure recorded at different moments, and determine a difference between page data after loading the same webpage at different times;
  • a difference marking unit for marking the differences on page screenshots at different times.
  • the page data recording unit includes:
  • a DOM node access module configured to access a specified DOM node and its child nodes in the page loaded at different times
  • An element information recording module configured to record element style, element attribute information, element content, element place information, and element label in the specified DOM node;
  • An element style splicing module for splicing an element style into a string and seeking a hash value for the string
  • the element information storage module is configured to serialize the element label, the element place information, the element attribute and the attribute value of the element, and the hash value, and store the data as a specific data structure.
  • the designated node accessed by the DOM node access module is all DOM nodes in the page or a DOM node that filters all the DOM nodes.
  • the difference determining unit determines, according to the specific data structure recorded at different times according to the LCS algorithm, the difference between the page data loaded by the same webpage at different times.
  • the difference marking unit marks the difference on a page screenshot of different moments according to the type of the difference and the position of the difference on the page.
  • the present invention provides a computer readable medium having program code executable by a processor, the program code causing a processor to perform the following steps:
  • the method and device for monitoring webpage change according to the present invention, by taking a screenshot of a page loaded at different times on the same webpage, and recording the page data loaded at different times of the same webpage as a specific data structure, and for any two moments The specific data structure is compared, and the difference is found. The difference is marked on the screenshot of the two moments, which can accurately compare the changes of the same webpage at different times, which is convenient for webpage monitoring.
  • FIG. 1 is a schematic flowchart of a webpage change monitoring method according to an embodiment of the present invention
  • FIG. 2 is a schematic flowchart of a webpage snapshot storage according to an embodiment of the present invention.
  • FIG. 3 is a schematic flow chart of snapshot comparison according to an embodiment of the present invention.
  • 4a to 4d are diagrams showing a difference presentation result according to an embodiment of the present invention.
  • FIG. 5 is a logical structural diagram of a webpage change monitoring apparatus according to an embodiment of the present invention.
  • FIG. 6 is a logical structural diagram of a specific embodiment of a webpage change monitoring apparatus according to an embodiment of the present invention.
  • FIG. 7 is a logical structural diagram of a device terminal according to an embodiment of the present invention.
  • the existing webpage comparison method is implemented based on the pixel comparison of the image after the screenshot, and the false positive rate is high.
  • the present invention records the data structure of the page data of the webpage as a specific data structure, and marks which page data is modified by comparing the differences between the specific data structures, and the modified page data is the content of the webpage change, Reduce the false positive rate of web page comparisons.
  • the page data is a webpage element, that is, an element constituting the content of the webpage, and the webpage element includes Text, images, audio, animation, video, text, and more.
  • FIG. 1 shows a flow of a webpage change monitoring method according to an embodiment of the present invention.
  • a webpage change monitoring method provided by an embodiment of the present invention includes:
  • Step S110 Record page data loaded by the same webpage at different times, and save the screenshots of the same webpage at different times; wherein the page data loaded at different times of the same webpage is recorded as a corresponding specific data structure.
  • the same webpage refers to a webpage with the same URL
  • the page data refers to a webpage element.
  • the data structure of the webpage element is a DOM structure (Document Object Model), and the page data loaded at different times of the same webpage is recorded as corresponding.
  • the specific data structure that is, the DOM structure of the web page element is recorded as a specific data structure, and the process of recording the DOM structure of the web page element into a specific data structure is in no particular order.
  • the time at which the page data of the web page is recorded and the time at which the page is taken are one-to-one correspondence.
  • the page data of the webpage is recorded at the first moment and the second moment, respectively, and the webpages of the webpage at the first moment and the second moment are respectively saved as screenshots.
  • web page elements include element styles, element attribute information, element content, element labels, and element placeholder information.
  • the present invention records the DOM structure of the web page element as a specific data structure, so as to reduce the amount of calculation when the elements are compared, which is specific in the embodiment of the present invention.
  • the data structure is a JSON structure (Javascript Object Notation, a lightweight data exchange format), but the DOM structure of a web page element can also be recorded as other specific data structures.
  • the elements of the JSON structure cannot be stored in the hard disk, the elements of the JSON structure need to be serialized into a format that can be stored by the hard disk and stored in the hard disk.
  • the DOM structure of the webpage element is recorded as a JSON structure and serialized storage.
  • the process is called web page snapshot storage, and the elements stored in the hard disk are snapshot data, and the content includes element type hash value, element attribute information, element content, element label, and element placeholder information.
  • Step S120 Determine the difference between the page data after the same webpage is loaded at different times by comparing the specific data structures recorded at different moments.
  • Comparing the specific data structures recorded at different moments is to find the different parts of the JSON structure of the webpage elements at different times, that is, to compare the snapshot data at different moments, thereby determining the page data between the same webpage loaded at different times. The difference.
  • the snapshot data stored in the hard disk cannot be compared. Therefore, before the difference between the snapshot data at different times is compared, the snapshot data at different times needs to be deserialized into a specific data structure. The process of data is called snapshot comparison.
  • Differences between snapshot data at different times include new elements, deleted elements, style modifications, and text content changes.
  • the above four changes represent the differences between elements of the same web page at different times, respectively:
  • the new element indicates that the same web page has an element added at different times
  • Deleting an element means that an element is deleted at a different time than the same page
  • Style modification means that the same web page does not add or delete elements at different times, but the element style has changed
  • Text content changes indicate that only the text content of an element has changed in the same web page at different times.
  • Step S130 Mark the difference on the screenshot of the page at different times.
  • the difference of the web pages at different times can be obtained.
  • the page screenshots are used to visually show the differences. Specifically, the type of the difference and the position at which the difference occurs on the page can be noted on the screenshot of the page.
  • the screenshots of the pages at different moments are stitched together, and the differences between the page data at different moments are marked on the screen shots of the stitched together, that is, between the elements in the same webpage at different times.
  • the different part marks are marked on the spliced page screenshots.
  • the different types of marks are used to mark the spliced page shots according to the type of difference.
  • the location of the tag is the location where the difference corresponds to the page screenshot.
  • the page at the second moment adds a webpage element compared to the page at the first moment, and the marked image is marked with a colored mark on the corresponding screenshot of the second moment.
  • mark the difference on the screenshot by a marker box (such as a dashed box).
  • Marking the difference on the page screenshot is to show the difference content on the page screenshot, and the embodiment of the present invention is called the difference presentation.
  • the above steps are data processing steps taken to implement the webpage change monitoring method provided by the embodiment of the present invention.
  • the main details of the implementation of the present invention are element information snapshot storage, snapshot comparison and difference presentation. These three aspects are respectively described in detail below.
  • FIG. 2 is a flowchart of a webpage snapshot storage according to an embodiment of the present invention. As shown in FIG. 2, the flow of webpage snapshot storage provided by the embodiment of the present invention includes the following steps:
  • Step S210 Access the webpage by using a command line browser.
  • the phantomjs browser is preferably used in the embodiment of the present invention.
  • the present invention uses a command line browser to browse a web page and injects a script to control the command line browser to access the web page. , but other command line browsers are also available.
  • Step S220 Inject a script into the command line browser.
  • Step S230 Access the specified DOM node, and record element attribute information, element style, element label, element content, and element placeholder information.
  • the accessing the specified DOM node generally refers to accessing the specified DOM node in the webpage page loaded at different moments.
  • the element attribute information includes an element attribute, an element attribute value (the element's html attribute, such as id, class, etc.) and an element attribute name;
  • Element styles include background color, borders, projections, etc.
  • the element placeholder information includes the X coordinate, the Y coordinate, the width, and the height of the element;
  • the element tag is the html tag name, such as body, div, h1, h2, etc.
  • the element content is a collection of child elements.
  • the specified DOM node is all DOM nodes in the page or DOM nodes that are filtered by all DOM nodes. Since the elements of the DOM node will eventually be presented on the page, the content on the page may be filtered by filtering.
  • the DOM node is implemented by the script control command line browser to specify which DOM nodes are accessed.
  • the DOM node accessed by the command line browser is the designated DOM node, and the DOM node not accessed by the command line browser is the filtered DOM node.
  • Step S240 splicing the element patterns into a string, and obtaining a hash value for the string according to the MD5 algorithm.
  • the present invention splicing the element patterns in the element information into a string when storing the element information, and then using the md5 algorithm (ie, : message digest algorithm) to get summary information (that is, to find a hash value) for this string, to get a 32-byte string, the string can be stored in the JSON structure, which can save storage space, if the element information A change has occurred, which inevitably causes the string to change, and the markup element style has changed during the comparison.
  • the md5 algorithm ie, : message digest algorithm
  • Step S250 Serializing the element label, the element placeholder information, the element attribute and the attribute value, and the hash value of the element into a specific data structure, and the specific data structure is usually a JSON structure. That is, the element information is stored as a JSON structure.
  • Step S260 determining whether the element has a child element; if yes, executing step S230; if not, executing step S270.
  • the DOM node of the child element is accessed, the element attribute information, the element style, the element label, and the element placeholder information are recorded, and the element style is spliced into a string, and the string information is summarized according to the MD5 algorithm. Get the hash value of the string, and then serialize the element attribute information, hash value, element label, and element placeholder information into a JSON structure.
  • the specified DOM node and its child nodes in the page loaded at different times can be accessed, and the element label, element place information, element style, element attribute, and attribute value of the elements in each node are recorded.
  • Step S270 storing the data of the specific data structure into the file system, and if the specific data structure is a JSON structure, storing the data of the JSON data into the file system.
  • the element information of the obtained JSON structure is stored in the file system, and the element information in the file system is a JSON structure.
  • file system refers to the file system of the user's operating system.
  • the screenshot of the page can be implemented in various ways, and the present invention will not be described in detail.
  • the above steps S210-S260 are steps taken to implement the data processing of the webpage snapshot storage, and can be performed on the network.
  • the page element performs snapshot storage to implement the function of monitoring and comparing the historical changes of the same webpage, and also realizing the function of shielding the specified content on the webpage.
  • the random content area on the webpage can be excluded, and the flexibility of webpage change monitoring can be improved.
  • the storage amount of the webpage snapshot storage can be reduced, and the amount of data for the snapshot comparison and the difference presentation is reduced, and the efficiency of the webpage change monitoring is improved.
  • FIG. 3 shows a flow of snapshot comparison according to an embodiment of the present invention, as shown in FIG.
  • the flow of snapshot comparison provided by the embodiment of the present invention includes the following steps:
  • S310 Input two historical time points, and read two sets of snapshot data according to two historical time points.
  • the following takes the time t1 and the time t2 as the historical time points, and compares the difference between the snapshot data at time t1 and the snapshot data at time t2, wherein the time t1 is far from the current time, and the time t2 is closer according to the current time.
  • Reading the snapshot data is the element information of the JSON structure. Since the element information of the JSON structure is serialized and stored as snapshot data, before reading the string of the JSON structure, the snapshot data needs to be deserialized to obtain the element information of the JSON structure. , then read the operation.
  • step S320 Determine whether the two element styles are consistent. If they are consistent, perform step S340; if not, perform step S330.
  • Step S330 Record element style modification difference.
  • the LCS algorithm is the Longest Common Subsequence, which is a prior art and will not be described in detail.
  • the longest common subsequence of element labels and element attributes in the child elements of two elements is the time t1 and At t2, the set of unaltered child elements shared by the two snapshot data, that is, the unaltered part in the screenshot of the page at time t1 and time t2.
  • the longest common subsequence that computes the element label and element attributes in the two child elements is to determine whether the page has deleted some elements or added some elements or modified the elements.
  • Step S360 Mark the difference content of the two sub-elements according to the longest common sub-sequence in which the element label and the element attribute of the two sub-elements are consistent.
  • the child element in the longest common subsequence is a text sub-element, it is judged whether the text content has changed. If there is a change, the text content is changed. If there is no change, the text content has not changed; if the longest common sub- If the child element in the sequence is another child element, the child element of the subsequence of the non-common subsequence in the snapshot data at time t1 is marked as a delete element, and the child element of the subsequence of the non-common subsequence in the snapshot data at time t2 is recorded. Mark as a new element.
  • Element modifications include modification of element content and modification of element styles, which are strings in text child elements.
  • a web page change is divided into three categories, each of which corresponds to a situation, that is, if the visual content of a web page changes, it must be in these three situations:
  • element additions, deletions, and element modifications are mutually exclusive categories, that is, if an element is new, Then it must not be deleted or modified; if it is a deleted element, it must not be added or modified, the deleted element is marked in the screenshot of the page at time t1, and the newly added element is marked in the screenshot of the page at time t2.
  • the modified element is marked in the screenshot of the page at time t2.
  • Step S370 Output a set of elements of all differences.
  • the LCS algorithm can compare the specific data structures recorded at different times to determine the difference between the page data loaded by the same webpage at different times.
  • the difference between the structure, the style, and the content at the historical time point of the webpage can be obtained by the above-mentioned web page snapshot storage and snapshot comparison stage.
  • the page screenshot of the difference is marked at different times according to the type of the difference and the position of the difference on the page. on. Since the stored webpage snapshot data records the place information (coordinates, width and height) of all the elements, and records the screenshot of the page at that time, it is possible to stitch the screenshots of the two time points and change the three differences of the webpage (new Marking elements, deleting elements, and modifying elements are marked on the stitched screenshot.
  • the three differences can be marked on the screenshot by different colors, or three differences can be marked on the screenshot by other means. For example, mark the difference on the screenshot by a marker box (such as a dashed box).
  • FIG. 4a-4d show the result of the difference presentation according to the embodiment of the present invention
  • the left side of the figure is equivalent to the screenshot of the page at time t1
  • the right side is the screenshot of the page corresponding to the time t2; as shown in FIG. 4a
  • the new elements in the webpage are marked with a dashed box on the right side.
  • the new element is “i am new here”; as shown in Figure 4b, the difference of the elements is deleted for the webpage.
  • the result is displayed.
  • the left side of the figure marks the deleted element of the webpage with a dotted frame, and the deleted element is “Hello”; as shown in FIG.
  • the result of the difference of the element style is modified for the webpage, and the webpage is marked with a dotted frame on the right side of the figure.
  • the modified element style has modified the hello world style, including tilting the font and adding an underscore; as shown in Figure 4d, the difference between the element texts is modified for the web page, and the right side of the figure is marked with the dotted box to mark the modified element text of the web page.
  • FIG. 5 shows a logical structure of a webpage change monitoring apparatus according to an embodiment of the present invention.
  • the webpage change monitoring apparatus 500 includes a page screenshot unit 510, a page data recording unit 520, a difference determining unit 530, and a difference marking unit 540.
  • the page screenshot unit 510 is configured to save a screenshot of a page loaded at different times on the same webpage.
  • the page data recording unit 520 is configured to separately record page data loaded by the same web page at different times; wherein the page data loaded at different times of the same web page is recorded as a corresponding specific data structure.
  • the difference determining unit 530 is configured to compare the specific data structures recorded at different time points, and determine the difference between the page data after the same web page is loaded at different times.
  • the difference marking unit 540 is used to mark the difference on the screenshot of the page at different times.
  • FIG. 6 shows a logical structure of a specific embodiment of a webpage change monitoring apparatus according to an embodiment of the present invention.
  • the page data recording unit 520 includes a DOM node access module 521, an element information recording module 522, an element style splicing module 523, and an element information storage module 524.
  • the DOM node accessing module 521 is configured to access the specified DOM node and its child nodes in the webpage loaded at different moments;
  • the element information recording module 522 is configured to record the element style, the element attribute information, the element content, and the element occupying in the DOM node.
  • element style splicing module 523 is used to splicing the recorded element style into a string, and hashing the string;
  • element information storage module 524 is used to set element label, element place information, element Attributes and attribute values and hash values are serialized and stored as a specific data structure.
  • the specific data structure usually adopts a JSON structure.
  • the designated node accessed by the DOM node access module is all DOM nodes in the page or DOM nodes that are filtered by all DOM nodes.
  • the difference determining unit 530 determines the difference between the page data after the same web page is loaded at different times according to the LCS algorithm comparing the specific data structures recorded at different times.
  • the difference marking unit 540 marks the difference on the page screenshot at different times according to the type of the difference and the position of the difference on the page.
  • the present invention further provides a device terminal.
  • the device terminal 700 includes a file system 710 and a webpage change monitoring device 500 for storing page screenshots and snapshot data.
  • the webpage change monitoring device includes:
  • the page screenshot unit is configured to save a screenshot of a page loaded at different times on the same webpage
  • a page data recording unit configured to separately record page data loaded by the same webpage at different times; wherein the page data loaded at different times of the same webpage is recorded as a corresponding specific data structure;
  • a difference determining unit configured to compare a specific data structure recorded at different moments, and determine a difference between page data after loading the same webpage at different times;
  • a difference tag unit for marking the difference on a screenshot of the page at different times.
  • the webpage change monitoring device has the structure described in FIG. 6, and the details are not described herein.
  • the present invention provides a computer readable medium having program code executable by a processor, the program code causing the processor to perform the following steps:
  • the above content describes in detail the webpage change monitoring method and apparatus provided by the present invention, by taking screenshots of webpage content at different time points, and recording element information of webpages at different time points into a specific data structure, which is different time of the same webpage.
  • the snapshot data on the point is recorded, and the snapshot data at any two time points is compared to find the difference, and the difference between the two snapshot data is marked on the screenshot of the two time points, which can accurately monitor Compare the changes on the same page.
  • the method according to the invention can also be implemented as a computer program executed by a processor (such as a CPU) in the terminal and stored in the memory of the terminal.
  • a processor such as a CPU
  • the above-described functions defined in the method of the present invention are performed when the computer program is executed by the processor.
  • the method steps and system units described above may also be implemented with a controller and a computer readable storage device for storing a computer program that causes the controller to implement the steps or unit functions described above.

Abstract

A method and a device for monitoring web page changes is provided; the method comprises: separately recording loaded page data for a same web page at different times, and taking screen captures of and saving loaded pages from said same web page at different times; recording as specific data structures the data structures of the loaded page data for said same web page at different times; by comparing the specific data structures recorded at different times, determining differences between loaded page data for said same web page at different times; marking each of said differences on the page screen captures from different times. Use of the method and device for monitoring web page changes allows for the accurate monitoring and comparison of changes that have occurred to a same web page at different times.

Description

网页变化监控方法及装置Webpage change monitoring method and device
本申请要求于2014年11月17日提交中国专利局、申请号为201410652444.3、发明名称为“网页变化监控方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。The present application claims priority to Chinese Patent Application No. 2014-10652444.3, filed on Jan.
技术领域Technical field
本发明涉及移动互联网技术领域,更为具体地,涉及一种网页变化监控方法及装置。The present invention relates to the field of mobile internet technologies, and in particular, to a webpage change monitoring method and apparatus.
背景技术Background technique
互联网以快速迭代著称,web应用会每周进行多次产品发布及运营内容更新,因此,企业对产品进行网页监控成为企业网页管理的重点之一。The Internet is known for its rapid iterations. Web applications perform product releases and operational content updates several times a week. Therefore, enterprise web monitoring of products has become one of the focuses of corporate web management.
目前,大多数企业对产品进行页面监控及对比的方法都是基于页面截图后的图片像素对比实现的,其误报率高。因此,对网页的历史修改做快照,并对两次历史快照间的差异进行对比,标记差异位置,成为企业对产品进行监控的迫切需求。At present, most companies' methods of page monitoring and comparison of products are based on the comparison of image pixels after screenshots, and the false positive rate is high. Therefore, taking snapshots of historical changes of web pages and comparing the differences between the two historical snapshots, marking the difference locations, has become an urgent need for enterprises to monitor products.
因此,如何能够准确地监控、对比同一网页的变化成为当前企业网页监控的主要问题。Therefore, how to accurately monitor and compare the changes of the same web page has become the main problem of current corporate webpage monitoring.
发明内容Summary of the invention
鉴于上述问题,本发明的目的是提供一种网页变化监控方法及装置,通过对网页在不同时刻的页面数据进行结构化记录和对比,从而找出网页在不同时刻的差异,同时,将找出的差异标记在网页的截图上,从而提高网页差异对比的准确性,更便于网页监控。In view of the above problems, an object of the present invention is to provide a webpage change monitoring method and apparatus, which can find out the difference of webpages at different moments by structurally recording and comparing webpage data at different times, and at the same time, will find out The difference is marked on the screenshot of the webpage, which improves the accuracy of the webpage difference comparison and makes it easier to monitor the webpage.
本发明提供的网页变化监控方法,包括:The webpage change monitoring method provided by the invention includes:
分别记录同一网页在不同时刻加载后的页面数据,并对同一网页不同时刻加载后的页面进行截图保存;其中,将同一网页不同时刻加载后的页面数据记录为相应地特定数据结构;Recording the page data of the same webpage after being loaded at different times, and saving the screenshots of the same webpage at different times; wherein the page data loaded at different times of the same webpage is recorded as a corresponding specific data structure;
通过对比不同时刻记录的特定数据结构,确定同一网页在不同时刻加载后的页面数据之间的差异; By comparing the specific data structures recorded at different times, the difference between the page data loaded by the same webpage at different times is determined;
将所述差异分别标记在不同时刻的页面截图上。The differences are marked separately on page screenshots at different times.
可选的,所述将同一网页不同时刻加载后的页面数据记录为相应地特定数据结构的过程包括:Optionally, the process of recording page data loaded at different times on the same webpage into corresponding specific data structures includes:
访问不同时刻加载后的页面中的指定的DOM节点及其子节点,记录各个节点中的元素的元素标签、元素占位信息、元素样式、元素属性及属性值,将所述元素样式拼接成字符串并对所述字符串求哈希值;Accessing the specified DOM node and its child nodes in the loaded page at different times, recording the element label, element place information, element style, element attribute, and attribute value of the elements in each node, and splicing the element style into characters String and hash the string;
将所述元素的元素标签、元素占位信息、元素属性及属性值以及所述哈希值进行序列化后存储为特定的数据结构。The element label, the element place information, the element attribute and the attribute value, and the hash value of the element are serialized and stored as a specific data structure.
可选的,所述指定节点是所述页面中所有的DOM节点或者是对所有的DOM节点进行过滤处理后的DOM节点。Optionally, the designated node is a DOM node in the page or a DOM node that filters all the DOM nodes.
可选的,所述通过对比不同时刻记录的特定数据结构,确定同一网页在不同时刻加载后的页面数据之间的差异包括:Optionally, by comparing the specific data structures recorded at different moments, determining differences between page data after loading the same webpage at different times includes:
根据LCS算法对比不同时刻记录的特定数据结构,确定同一网页在不同时刻加载后的页面数据之间的差异。The LCS algorithm compares the specific data structures recorded at different times to determine the difference between the page data loaded by the same web page at different times.
可选的,所述将所述差异分别标记在不同时刻的页面截图上包括:Optionally, the step of marking the difference respectively on the screenshot of the page at different moments comprises:
根据所述差异的类型以及所述差异在页面上的位置将所述差异标记在不同时刻的页面截图上。The difference is marked on a screenshot of the page at different times based on the type of difference and the location of the difference on the page.
本发明提供的网页变化监控装置,包括:The webpage change monitoring device provided by the invention includes:
页面数据记录单元,用于分别记录同一网页在不同时刻加载后的页面数据;其中,将同一网页不同时刻加载后的页面数据记录为相应地特定数据结构;a page data recording unit, configured to separately record page data loaded by the same webpage at different times; wherein the page data loaded at different times of the same webpage is recorded as a corresponding specific data structure;
页面截图单元,用于对同一网页不同时刻加载后的页面进行截图保存;The page screenshot unit is configured to save a screenshot of a page loaded at different times on the same webpage;
差异确定单元,用于对比不同时刻记录的特定数据结构,确定同一网页在不同时刻加载后的页面数据之间的差异;a difference determining unit, configured to compare a specific data structure recorded at different moments, and determine a difference between page data after loading the same webpage at different times;
差异标记单元,用于将所述差异分别标记在不同时刻的页面截图上。A difference marking unit for marking the differences on page screenshots at different times.
可选的,所述页面数据记录单元包括:Optionally, the page data recording unit includes:
DOM节点访问模块,用于访问不同时刻加载后的页面中的指定的DOM节点及其子节点; a DOM node access module, configured to access a specified DOM node and its child nodes in the page loaded at different times;
元素信息记录模块,用于记录指定的DOM节点中的元素样式、元素属性信息、元素内容、元素占位信息及元素标签;An element information recording module, configured to record element style, element attribute information, element content, element place information, and element label in the specified DOM node;
元素样式拼接模块,用于将元素样式拼接成字符串,并对所述字符串求哈希值;An element style splicing module for splicing an element style into a string and seeking a hash value for the string;
元素信息存储模块,用于将所述元素的元素标签、元素占位信息、元素属性及属性值以及所述哈希值进行序列化后存储为特定的数据结构。The element information storage module is configured to serialize the element label, the element place information, the element attribute and the attribute value of the element, and the hash value, and store the data as a specific data structure.
可选的,在所述DOM节点访问模块访问的指定节点是所述页面中所有的DOM节点或者是对所有的DOM节点进行过滤处理后的DOM节点。Optionally, the designated node accessed by the DOM node access module is all DOM nodes in the page or a DOM node that filters all the DOM nodes.
可选的,所述差异确定单元根据LCS算法对比不同时刻记录的特定数据结构,确定同一网页在不同时刻加载后的页面数据之间的差异。Optionally, the difference determining unit determines, according to the specific data structure recorded at different times according to the LCS algorithm, the difference between the page data loaded by the same webpage at different times.
可选的,所述差异标记单元根据所述差异的类型以及所述差异在页面上的位置将所述差异标记在不同时刻的页面截图上。Optionally, the difference marking unit marks the difference on a page screenshot of different moments according to the type of the difference and the position of the difference on the page.
本发明提供一种具有处理器可执行的程序代码的计算机可读介质,所述程序代码使处理器执行下述步骤:The present invention provides a computer readable medium having program code executable by a processor, the program code causing a processor to perform the following steps:
分别记录同一网页在不同时刻加载后的页面数据,并对同一网页不同时刻加载后的页面进行截图保存;其中,将同一网页不同时刻加载后的页面数据记录为相应地特定数据结构;Recording the page data of the same webpage after being loaded at different times, and saving the screenshots of the same webpage at different times; wherein the page data loaded at different times of the same webpage is recorded as a corresponding specific data structure;
通过对比不同时刻记录的特定数据结构,确定同一网页在不同时刻加载后的页面数据之间的差异;By comparing the specific data structures recorded at different times, the difference between the page data loaded by the same webpage at different times is determined;
将所述差异分别标记在不同时刻的页面截图上。The differences are marked separately on page screenshots at different times.
上述根据本发明提供的网页变化监控方法及装置,通过对同一网页不同时刻加载后的页面进行截图,以及将同一网页不同时刻加载后的页面数据记录为特定数据结构,并对任意两个时刻的特定数据结构进行对比,找出差异的部分,将差异的部分对应标记在两个时刻的截图上,能够准确地对比同一网页在不同时刻发生的变化,便于网页监控。The method and device for monitoring webpage change according to the present invention, by taking a screenshot of a page loaded at different times on the same webpage, and recording the page data loaded at different times of the same webpage as a specific data structure, and for any two moments The specific data structure is compared, and the difference is found. The difference is marked on the screenshot of the two moments, which can accurately compare the changes of the same webpage at different times, which is convenient for webpage monitoring.
为了实现上述以及相关目的,本发明的一个或多个方面包括后面将详细说明并在权利要求中特别指出的特征。下面的说明以及附图详细说明了本发明的某些示例性方面。然而,这些方面指示的仅仅是可使用本发明的原理的各种方式中的一些方式。 此外,本发明旨在包括所有这些方面以及它们的等同物。In order to achieve the above and related ends, one or more aspects of the present invention include the features which are described in detail below and particularly pointed out in the claims. The following description and the annexed drawings set forth in detail However, these aspects are indicative of only some of the various ways in which the principles of the invention may be employed. Furthermore, the invention is intended to cover all such aspects and their equivalents.
附图说明DRAWINGS
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,对于本领域普通技术人员而言,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the description of the prior art will be briefly described below, and it will be apparent to those skilled in the art that In other words, other drawings can be obtained based on these drawings without paying for creative labor.
图1为根据本发明实施例的网页变化监控方法的流程示意图;FIG. 1 is a schematic flowchart of a webpage change monitoring method according to an embodiment of the present invention; FIG.
图2为根据本发明实施例的网页快照存储的流程示意图;2 is a schematic flowchart of a webpage snapshot storage according to an embodiment of the present invention;
图3为根据本发明实施例的快照对比的流程示意图;3 is a schematic flow chart of snapshot comparison according to an embodiment of the present invention;
图4a~图4d分别为根据本发明实施例的差异展现结果图;4a to 4d are diagrams showing a difference presentation result according to an embodiment of the present invention;
图5为根据本发明实施例的网页变化监控装置的逻辑结构图;FIG. 5 is a logical structural diagram of a webpage change monitoring apparatus according to an embodiment of the present invention; FIG.
图6为根据本发明实施例的网页变化监控装置一个具体实施方式的逻辑结构图;FIG. 6 is a logical structural diagram of a specific embodiment of a webpage change monitoring apparatus according to an embodiment of the present invention; FIG.
图7为根据本发明实施例的设备终端的逻辑结构图。FIG. 7 is a logical structural diagram of a device terminal according to an embodiment of the present invention.
在所有附图中相同的标号指示相似或相应的特征或功能。The same reference numerals are used throughout the drawings to refer to the
具体实施方式detailed description
下面描述本公开的各个方面。应该明白的是,本文的教导可以以多种多样形式具体体现,并且在本文中公开的任何具体结构、功能或两者仅仅是代表性的。基于本文的教导,本领域技术人员应该明白的是,本文所公开的一个方面可以独立于任何其它方面实现,并且这些方面中的两个或多个方面可以按照各种方式组合。例如,可以使用本文所阐述的任何数目的方面,实现装置或实践方法。另外,可以使用其它结构、功能、或除了本文所阐述的一个或多个方面之外或不是本文所阐述的一个或多个方面的结构和功能,实现这种装置或实践这种方法。此外,本文所描述的任何方面可以包括权利要求的至少一个元素。Various aspects of the disclosure are described below. It should be understood that the teachings herein may be embodied in a variety of forms and that any specific structure, function, or both disclosed herein are merely representative. Based on the teachings herein, one of ordinary skill in the art will appreciate that one aspect disclosed herein can be implemented independently of any other aspects, and two or more of these aspects can be combined in various ways. For example, an apparatus or a method of practice can be implemented using any number of the aspects set forth herein. In addition, such an apparatus may be implemented or practiced using other structures, functions, or structures and functions in addition to or in one or more aspects than those set forth herein. Furthermore, any aspect described herein can include at least one element of the claims.
对于同一网页的变化,现有的网页对比方法是基于页面截图后的图片像素对比实现的,其误报率高。针对此问题,本发明将网页的页面数据的数据结构记录为特定数据结构,通过对比特定数据结构之间的差异标记出哪些页面数据进行了修改,修改的页面数据即为网页变化的内容,可以降低网页对比的误报率。For the change of the same webpage, the existing webpage comparison method is implemented based on the pixel comparison of the image after the screenshot, and the false positive rate is high. To solve this problem, the present invention records the data structure of the page data of the webpage as a specific data structure, and marks which page data is modified by comparing the differences between the specific data structures, and the modified page data is the content of the webpage change, Reduce the false positive rate of web page comparisons.
其中,页面数据就是网页元素,也就是指组成网页内容的元素,网页元素包括 文字、图片、音频、动画、视频、文字等等。Wherein, the page data is a webpage element, that is, an element constituting the content of the webpage, and the webpage element includes Text, images, audio, animation, video, text, and more.
以下将结合附图对本发明的具体实施例进行详细描述。Specific embodiments of the present invention will be described in detail below with reference to the drawings.
图1示出了根据本发明实施例的网页变化监控方法的流程。FIG. 1 shows a flow of a webpage change monitoring method according to an embodiment of the present invention.
如图1所示,本发明实施例提供的网页变化监控方法,包括:As shown in FIG. 1 , a webpage change monitoring method provided by an embodiment of the present invention includes:
步骤S110:分别记录同一网页在不同时刻加载后的页面数据,并对同一网页不同时刻加载后的页面进行截图保存;其中,将同一网页不同时刻加载后的页面数据记录为相应地特定数据结构。Step S110: Record page data loaded by the same webpage at different times, and save the screenshots of the same webpage at different times; wherein the page data loaded at different times of the same webpage is recorded as a corresponding specific data structure.
其中,同一网页是指同一个URL的网页,页面数据就是指网页元素,网页元素的数据结构为DOM结构(Document Object Model,文件对象模型),将同一网页不同时刻加载后的页面数据记录为相应地特定数据结构,也就是将网页元素的DOM结构记录为特定数据结构,而将网页元素的DOM结构记录为特定数据结构的流程与页面截图的流程不分先后顺序。The same webpage refers to a webpage with the same URL, and the page data refers to a webpage element. The data structure of the webpage element is a DOM structure (Document Object Model), and the page data loaded at different times of the same webpage is recorded as corresponding. The specific data structure, that is, the DOM structure of the web page element is recorded as a specific data structure, and the process of recording the DOM structure of the web page element into a specific data structure is in no particular order.
这里,记录网页的页面数据的时刻和进行页面截图的时刻是一一对应的时刻。例如,分别在第一时刻和第二时刻记录网页的页面数据,同时对该网页在第一时刻和第二时刻的网页分别进行截图保存。Here, the time at which the page data of the web page is recorded and the time at which the page is taken are one-to-one correspondence. For example, the page data of the webpage is recorded at the first moment and the second moment, respectively, and the webpages of the webpage at the first moment and the second moment are respectively saved as screenshots.
另外,网页元素包括元素样式、元素属性信息、元素内容、元素标签及元素占位信息。In addition, web page elements include element styles, element attribute information, element content, element labels, and element placeholder information.
由于DOM结构的元素数据量大,在进行元素对比时计算量过于庞大,因此本发明将网页元素的DOM结构记录为特定数据结构,以便减少元素对比时的计算量,本发明实施例中的特定数据结构为JSON结构(Javascript Object Notation,轻量级的数据交换格式),但也可以将网页元素的DOM结构记录成其它的特定数据结构。Since the amount of element data of the DOM structure is large, the calculation amount is too large when the element comparison is performed. Therefore, the present invention records the DOM structure of the web page element as a specific data structure, so as to reduce the amount of calculation when the elements are compared, which is specific in the embodiment of the present invention. The data structure is a JSON structure (Javascript Object Notation, a lightweight data exchange format), but the DOM structure of a web page element can also be recorded as other specific data structures.
由于JSON结构的元素无法存储在硬盘中,因此需要将JSON结构的元素序列化为硬盘能够存储的格式,存储在硬盘中,本发明实施例将网页元素的DOM结构记录为JSON结构并序列化存储的过程称为网页快照存储,存储在硬盘中的元素为快照数据,其内容包括元素样式的哈希值、元素属性信息、元素内容、元素标签及元素占位信息。Since the elements of the JSON structure cannot be stored in the hard disk, the elements of the JSON structure need to be serialized into a format that can be stored by the hard disk and stored in the hard disk. In the embodiment of the present invention, the DOM structure of the webpage element is recorded as a JSON structure and serialized storage. The process is called web page snapshot storage, and the elements stored in the hard disk are snapshot data, and the content includes element type hash value, element attribute information, element content, element label, and element placeholder information.
步骤S120:通过对比不同时刻记录的特定数据结构,确定同一网页在不同时刻加载后的页面数据之间的差异。 Step S120: Determine the difference between the page data after the same webpage is loaded at different times by comparing the specific data structures recorded at different moments.
对比不同时刻记录的特定数据结构,就是寻找不同时刻的JSON结构的网页元素之间不相同的部分,也就是对比不同时刻的快照数据,从而确定出同一网页在不同时刻加载后的页面数据之间的差异。Comparing the specific data structures recorded at different moments, is to find the different parts of the JSON structure of the webpage elements at different times, that is, to compare the snapshot data at different moments, thereby determining the page data between the same webpage loaded at different times. The difference.
由于存储在硬盘中的快照数据无法进行对比,所以在对比不同时刻的快照数据间的差异之前,需要将不同时刻的快照数据反序列化为特定数据结构,本发明实施例将对比不同时刻的快照数据的过程称为快照对比。The snapshot data stored in the hard disk cannot be compared. Therefore, before the difference between the snapshot data at different times is compared, the snapshot data at different times needs to be deserialized into a specific data structure. The process of data is called snapshot comparison.
不同时刻快照数据间的差异包括新增元素、删除元素、样式修改和文本内容改变,上述四种变化表示不同时刻中同一个网页的元素间差异,分别为:Differences between snapshot data at different times include new elements, deleted elements, style modifications, and text content changes. The above four changes represent the differences between elements of the same web page at different times, respectively:
新增元素表示同一个网页在不同时刻相比增加了一个元素;The new element indicates that the same web page has an element added at different times;
删除元素表示同一个网页不同时刻相比删除了一个元素;Deleting an element means that an element is deleted at a different time than the same page;
样式修改表示同一个网页在不同时刻相比没有增加或删除元素,而是元素样式发生了改变;Style modification means that the same web page does not add or delete elements at different times, but the element style has changed;
文本内容改变表示同一个网页在不同时刻中只有元素的文本内容发生了改变。Text content changes indicate that only the text content of an element has changed in the same web page at different times.
步骤S130:将该差异分别标记在不同时刻的页面截图上。Step S130: Mark the difference on the screenshot of the page at different times.
将不同时刻记录的网页数据结构对比之后,可以得出该网页在不同时刻的差异。所述页面截图用于直观地展示所述差异。具体地,可以在所述页面截图上标注出差异的类型和所述差异在页面上发生的位置。After comparing the data structures of the web pages recorded at different times, the difference of the web pages at different times can be obtained. The page screenshots are used to visually show the differences. Specifically, the type of the difference and the position at which the difference occurs on the page can be noted on the screenshot of the page.
为了便于对比不同时刻的页面截图,将不同时刻的页面截图拼接在一起,再将不同时刻的页面数据间的差异标记在拼接在一起的页面截图上,也就是将不同时刻同一网页中元素之间不相同的部分标记在拼接在一起的页面截图上,标记的方式多种多样,具体地,在其中一种标记方式中,按照差异的类型用不同颜色标记在拼接在一起的页面截图上。标记的位置为差异在页面截图上对应发生的位置。例如,第二时刻的页面与第一时刻的页面相比,增加了一个网页元素,则在第二时刻对应的页面截图上用带颜色的标记在增加元素的位置进行标记。或者,通过标记框(如虚线框)将差异标记在截图上。In order to compare the screenshots of the pages at different times, the screenshots of the pages at different moments are stitched together, and the differences between the page data at different moments are marked on the screen shots of the stitched together, that is, between the elements in the same webpage at different times. The different part marks are marked on the spliced page screenshots. In particular, in one of the mark methods, the different types of marks are used to mark the spliced page shots according to the type of difference. The location of the tag is the location where the difference corresponds to the page screenshot. For example, the page at the second moment adds a webpage element compared to the page at the first moment, and the marked image is marked with a colored mark on the corresponding screenshot of the second moment. Alternatively, mark the difference on the screenshot by a marker box (such as a dashed box).
在页面截图上标记差异也就是在页面截图上展现差异内容,本发明实施例称为差异展现。Marking the difference on the page screenshot is to show the difference content on the page screenshot, and the embodiment of the present invention is called the difference presentation.
上述步骤为实现本发明实施例提供的网页变化监控方法所采取的数据处理步 骤,其中,本发明实施的主要细节在于元素信息快照存储、快照对比及差异展现,下面分别对这三个方面进行详细地说明。The above steps are data processing steps taken to implement the webpage change monitoring method provided by the embodiment of the present invention. The main details of the implementation of the present invention are element information snapshot storage, snapshot comparison and difference presentation. These three aspects are respectively described in detail below.
一、网页快照存储First, the web page snapshot storage
图2示出了根据本发明实施例的网页快照存储的流程,如图2所示,本发明实施例提供的网页快照存储的流程包括以下步骤:FIG. 2 is a flowchart of a webpage snapshot storage according to an embodiment of the present invention. As shown in FIG. 2, the flow of webpage snapshot storage provided by the embodiment of the present invention includes the following steps:
步骤S210:使用命令行浏览器访问网页。Step S210: Access the webpage by using a command line browser.
由于需要在访问网页的同时操作网页中的元素,因此,本发明使用命令行浏览器浏览网页,通过向命令行浏览器注入脚本控制命令行浏览器访问网页,本发明实施例优选采用phantomjs浏览器,但也可以采用其它命令行浏览器。The phantomjs browser is preferably used in the embodiment of the present invention. The present invention uses a command line browser to browse a web page and injects a script to control the command line browser to access the web page. , but other command line browsers are also available.
步骤S220:向命令行浏览器注入脚本。Step S220: Inject a script into the command line browser.
在网页加载完成后,向命令行浏览器注入脚本,用于操作网页中的元素。After the page is loaded, inject a script into the command line browser to manipulate the elements in the web page.
步骤S230:访问指定的DOM节点,记录元素属性信息、元素样式、元素标签、元素内容及元素占位信息。其中,所述访问指定的DOM节点通常指的是访问不同时刻加载后的网页页面中的指定的DOM节点。Step S230: Access the specified DOM node, and record element attribute information, element style, element label, element content, and element placeholder information. The accessing the specified DOM node generally refers to accessing the specified DOM node in the webpage page loaded at different moments.
元素属性信息包括元素属性、元素属性值(元素的html属性,比如id、class等)和元素属性名;The element attribute information includes an element attribute, an element attribute value (the element's html attribute, such as id, class, etc.) and an element attribute name;
元素样式包括背景色、边框、投影等等;Element styles include background color, borders, projections, etc.
元素占位信息包括元素的X坐标、Y坐标、宽度和高度;The element placeholder information includes the X coordinate, the Y coordinate, the width, and the height of the element;
元素标签为html标签名称,比如body、div、h1、h2等;The element tag is the html tag name, such as body, div, h1, h2, etc.
元素内容即子元素的集合。The element content is a collection of child elements.
指定的DOM节点是页面中所有的DOM节点或者是对所有的DOM节点进行过滤处理后的DOM节点,由于DOM节点的元素最终会呈现在页面上,因此,想屏蔽页面上的内容可以通过过滤该DOM节点实现,由脚本控制命令行浏览器指定访问哪些DOM节点,命令行浏览器访问的DOM节点为指定的DOM节点,而命令行浏览器未访问的DOM节点为过滤掉的DOM节点。通过过滤无需访问的DOM节点,能够对网页上的随机内容区域进行排除,提高网页变化监控的灵活性。并且,在过滤掉无需访问的DOM节点后,能够减少网页快照存储的存储量,以及减少快照对比及差异展现所针对的数据量,提高网页变化监控的效率。 The specified DOM node is all DOM nodes in the page or DOM nodes that are filtered by all DOM nodes. Since the elements of the DOM node will eventually be presented on the page, the content on the page may be filtered by filtering. The DOM node is implemented by the script control command line browser to specify which DOM nodes are accessed. The DOM node accessed by the command line browser is the designated DOM node, and the DOM node not accessed by the command line browser is the filtered DOM node. By filtering the DOM nodes that do not need to be accessed, the random content areas on the webpage can be excluded, and the flexibility of webpage change monitoring can be improved. Moreover, after filtering out the DOM node that does not need to be accessed, the storage capacity of the webpage snapshot storage can be reduced, and the amount of data targeted by the snapshot comparison and the difference presentation can be reduced, and the efficiency of webpage change monitoring can be improved.
步骤S240:将元素样式拼接成一个字符串,根据MD5算法对该字符串求哈希值。Step S240: splicing the element patterns into a string, and obtaining a hash value for the string according to the MD5 algorithm.
由于页面有大量的元素,如果把每个元素信息完整记录下来存储量太大,所以本发明在存储元素信息的时候,把元素信息中的元素样式拼接成一个字符串,然后用md5算法(即:消息摘要算法)对这个字符串求摘要信息(也就是求哈希值),得到一个32字节的字符串,该字符串就能够存储在JSON结构中,这样可以节省存储空间,如果元素信息发生了变化,那么必然会导致该字符串发生改变,就可以在对比的过程中标记元素样式发生了改变。Since the page has a large number of elements, if the information of each element is completely recorded and the storage amount is too large, the present invention splicing the element patterns in the element information into a string when storing the element information, and then using the md5 algorithm (ie, : message digest algorithm) to get summary information (that is, to find a hash value) for this string, to get a 32-byte string, the string can be stored in the JSON structure, which can save storage space, if the element information A change has occurred, which inevitably causes the string to change, and the markup element style has changed during the comparison.
步骤S250:将所述元素的元素标签、元素占位信息、元素属性及属性值以及所述哈希值进行序列化后存储为特定的数据结构,该特定的数据结构通常为JSON结构。也就是说,将元素信息存储为JSON结构。Step S250: Serializing the element label, the element placeholder information, the element attribute and the attribute value, and the hash value of the element into a specific data structure, and the specific data structure is usually a JSON structure. That is, the element information is stored as a JSON structure.
步骤S260:判断元素是否有子元素;如果有,执行步骤S230;如果没有,执行步骤S270。Step S260: determining whether the element has a child element; if yes, executing step S230; if not, executing step S270.
如果元素有子元素,则访问子元素的DOM节点,记录元素属性信息、元素样式、元素标签及元素占位信息,将元素样式拼接成一个字符串,根据MD5算法对该字符串求摘要信息,获取该字符串的哈希值,再将元素属性信息、哈希值、元素标签及元素占位信息序列化存储为JSON结构。If the element has a child element, the DOM node of the child element is accessed, the element attribute information, the element style, the element label, and the element placeholder information are recorded, and the element style is spliced into a string, and the string information is summarized according to the MD5 algorithm. Get the hash value of the string, and then serialize the element attribute information, hash value, element label, and element placeholder information into a JSON structure.
通过步骤S230至步骤S260,能够访问不同时刻加载后的页面中的指定的DOM节点及其子节点,记录各个节点中的元素的元素标签、元素占位信息、元素样式、元素属性及属性值,将所述元素样式拼接成字符串并对所述字符串求哈希值,并将所述元素的元素标签、元素占位信息、元素属性及属性值以及所述哈希值进行序列化后存储为特定的数据结构。Through steps S230 to S260, the specified DOM node and its child nodes in the page loaded at different times can be accessed, and the element label, element place information, element style, element attribute, and attribute value of the elements in each node are recorded. Splicing the element pattern into a string and hashing the string, and serializing the element label, element place information, element attribute and attribute value, and the hash value of the element, and storing the hash value For a specific data structure.
步骤S270:将特定的数据结构的数据存储到文件系统中,若该特定的数据结构为JSON结构,则将JSON数据的数据存储到文件系统中。Step S270: storing the data of the specific data structure into the file system, and if the specific data structure is a JSON structure, storing the data of the JSON data into the file system.
在完成所有节点遍历后,将获取的JSON结构的元素信息存储到文件系统中,文件系统中的元素信息为JSON结构。After all the nodes are traversed, the element information of the obtained JSON structure is stored in the file system, and the element information in the file system is a JSON structure.
另外,文件系统指的是用户操作系统的文件系统。In addition, the file system refers to the file system of the user's operating system.
页面截图可以通过多种方式实现,本发明不做详细说明。The screenshot of the page can be implemented in various ways, and the present invention will not be described in detail.
上述步骤S210~S260为实施网页快照存储的所采取的数据处理步骤,能够对网 页元素进行快照存储,实现监控、对比同一网页的历史变化的功能,还可以实现屏蔽网页上指定内容的功能。通过屏蔽网页上指定内容这一功能,能够对网页上的随机内容区域进行排除,提高网页变化监控的灵活性。并且,在屏蔽掉网页上指定内容后,能够减少网页快照存储的存储量,以及减少快照对比及差异展现所针对的数据量,提高网页变化监控的效率。The above steps S210-S260 are steps taken to implement the data processing of the webpage snapshot storage, and can be performed on the network. The page element performs snapshot storage to implement the function of monitoring and comparing the historical changes of the same webpage, and also realizing the function of shielding the specified content on the webpage. By blocking the specified content on the webpage, the random content area on the webpage can be excluded, and the flexibility of webpage change monitoring can be improved. Moreover, after the specified content on the webpage is blocked, the storage amount of the webpage snapshot storage can be reduced, and the amount of data for the snapshot comparison and the difference presentation is reduced, and the efficiency of the webpage change monitoring is improved.
二、快照对比Second, the snapshot comparison
网页快照存储后,在网页内容发生变化时,需要对网页不同时刻的内容进行对比,也就是对比网页快照,图3示出了根据本发明实施例的快照对比的流程,如图3所示,本发明实施例提供的快照对比的流程包括以下步骤:After the webpage snapshot is stored, when the webpage content is changed, the content of the webpage at different times needs to be compared, that is, the webpage snapshot is compared. FIG. 3 shows a flow of snapshot comparison according to an embodiment of the present invention, as shown in FIG. The flow of snapshot comparison provided by the embodiment of the present invention includes the following steps:
S310:输入两个历史时间点,根据两个历史时间点读取两组快照数据。S310: Input two historical time points, and read two sets of snapshot data according to two historical time points.
以下将以t1时刻和t2时刻作为历史时间点,对比t1时刻的快照数据与t2时刻的快照数据之间的差异内容,其中,t1时刻据当前时间较远,t2时刻据当前时间较近。The following takes the time t1 and the time t2 as the historical time points, and compares the difference between the snapshot data at time t1 and the snapshot data at time t2, wherein the time t1 is far from the current time, and the time t2 is closer according to the current time.
读取快照数据就是JSON结构的元素信息,由于JSON结构的元素信息是经过序列化存储为快照数据,所以在读取JSON结构的字符串前,需要反序列化快照数据,得到JSON结构的元素信息,之后读取操作。Reading the snapshot data is the element information of the JSON structure. Since the element information of the JSON structure is serialized and stored as snapshot data, before reading the string of the JSON structure, the snapshot data needs to be deserialized to obtain the element information of the JSON structure. , then read the operation.
S320:判断两个元素样式是否一致,如果一致,执行步骤S340;如果不一致,执行步骤S330。S320: Determine whether the two element styles are consistent. If they are consistent, perform step S340; if not, perform step S330.
首先判断元素信息中的元素样式是否相同,即获得的32字节字符串,如果两个JSON结构的字符串一样,说明元素样式未作修改,如果两个JSON结构的字符串不一样,说明元素样式已被修改。First, determine whether the element styles in the element information are the same, that is, the obtained 32-byte string. If the strings of the two JSON structures are the same, the element style is not modified. If the strings of the two JSON structures are different, the description elements are The style has been modified.
步骤S330:记录元素样式修改差异。Step S330: Record element style modification difference.
S340:判断两个元素是否有子元素,如果有,执行步骤S350;如果没有,执行步骤S370。S340: Determine whether the two elements have child elements, if yes, execute step S350; if not, execute step S370.
S350:利用LCS算法求出两个元素的子元素中元素标签和元素属性一致的最长公共子序列。S350: Using the LCS algorithm to find the longest common subsequence of the element label and the element attribute in the child elements of the two elements.
LCS算法即最长公共子序列算法(Longest Common Subsequence),其为现有技术,本发明不做详细地说明。The LCS algorithm is the Longest Common Subsequence, which is a prior art and will not be described in detail.
两个元素的子元素中元素标签和元素属性一致的最长公共子序列即为t1时刻和 t2时刻两个快照数据共有的未改变的子元素集合,也就是t1时刻和t2时刻页面截图中为未改变的部分。The longest common subsequence of element labels and element attributes in the child elements of two elements is the time t1 and At t2, the set of unaltered child elements shared by the two snapshot data, that is, the unaltered part in the screenshot of the page at time t1 and time t2.
计算两个子元素中元素标签和元素属性一致的最长公共子序列就是为了判断网页在是否删除了一些元素或新增了一些元素或对元素进行了修改。The longest common subsequence that computes the element label and element attributes in the two child elements is to determine whether the page has deleted some elements or added some elements or modified the elements.
步骤S360:根据两个子元素中元素标签和元素属性一致的最长公共子序列标记出两个子元素的差异内容。Step S360: Mark the difference content of the two sub-elements according to the longest common sub-sequence in which the element label and the element attribute of the two sub-elements are consistent.
如果最长公共子序列中的子元素是文本子元素,则判断文本内容是否有变化,如果有变化,说明文本内容发生改变,如果没有变化,则说明文本内容未发生改变;如果最长公共子序列中的子元素是其他子元素,则将t1时刻的快照数据中非公共子序列的子序列的子元素标记为删除元素,将t2时刻的快照数据中非公共子序列的子序列的子元素标记为新增元素。If the child element in the longest common subsequence is a text sub-element, it is judged whether the text content has changed. If there is a change, the text content is changed. If there is no change, the text content has not changed; if the longest common sub- If the child element in the sequence is another child element, the child element of the subsequence of the non-common subsequence in the snapshot data at time t1 is marked as a delete element, and the child element of the subsequence of the non-common subsequence in the snapshot data at time t2 is recorded. Mark as a new element.
元素修改包括元素内容的修改和元素样式的修改,元素内容就是文本子元素中的字符串。Element modifications include modification of element content and modification of element styles, which are strings in text child elements.
一个网页的改变一共分为三种类别,每一个类别对应一种情况,也就是说,一个网页的可视内容如果发生改变,一定在这三种情况内:A web page change is divided into three categories, each of which corresponds to a situation, that is, if the visual content of a web page changes, it must be in these three situations:
1.删除了某些元素,对应于删除元素;1. Some elements are deleted, corresponding to deleting elements;
2.新增了某些元素,对应于新增元素;2. Some new elements have been added, corresponding to new elements;
3.没删除也没新增的元素中有些发生了改变:元素内容的改变或者元素样式的改变。3. Some of the elements that have not been deleted or added have changed: changes in the content of the element or changes in the style of the element.
三种类别中,元素新增、删除元素和元素修改(包括内容和样式的修改,可能同时发生,但也没有固定顺序)是互斥的类别,也就是说,如果一个元素是新增的,那么它必然不是删除的或修改的;如果是删除的元素,也必然不是新增或者修改的,删除的元素标记在t1时刻的页面截图中,新增的元素标记在t2时刻的页面截图中,修改的元素标记在t2时刻的页面截图中。Among the three categories, element additions, deletions, and element modifications (including modifications to content and styles, which may occur simultaneously, but no fixed order) are mutually exclusive categories, that is, if an element is new, Then it must not be deleted or modified; if it is a deleted element, it must not be added or modified, the deleted element is marked in the screenshot of the page at time t1, and the newly added element is marked in the screenshot of the page at time t2. The modified element is marked in the screenshot of the page at time t2.
步骤S370:输出所有差异的元素集合。Step S370: Output a set of elements of all differences.
返回全部有修改的元素的集合。通过上述步骤,能够根据LCS算法对比不同时刻记录的特定数据结构,确定同一网页在不同时刻加载后的页面数据之间的差异。Returns a collection of all modified elements. Through the above steps, the LCS algorithm can compare the specific data structures recorded at different times to determine the difference between the page data loaded by the same webpage at different times.
三、差异展现 Third, the difference shows
由上述网页快照存储和快照对比阶段可以得到网页历史时间点上的结构、样式以及内容的差异,为了体现该差异,根据差异的类型以及差异在页面上的位置将差异标记在不同时刻的页面截图上。由于存储的网页快照数据记录了所有元素的占位信息(坐标、宽高),并记录了当时的页面截图,因此可以拼接两个时间点的页面截图,并将网页变化的三种差异(新增元素、删除元素、修改元素)标记在拼接后的截图上,具体地,可以通过不同的颜色将三种差异标记在截图上,也可以通过其它方式在截图上标记出三种差异。例如,通过标记框(如虚线框)将差异标记在截图上。The difference between the structure, the style, and the content at the historical time point of the webpage can be obtained by the above-mentioned web page snapshot storage and snapshot comparison stage. In order to reflect the difference, the page screenshot of the difference is marked at different times according to the type of the difference and the position of the difference on the page. on. Since the stored webpage snapshot data records the place information (coordinates, width and height) of all the elements, and records the screenshot of the page at that time, it is possible to stitch the screenshots of the two time points and change the three differences of the webpage (new Marking elements, deleting elements, and modifying elements are marked on the stitched screenshot. Specifically, the three differences can be marked on the screenshot by different colors, or three differences can be marked on the screenshot by other means. For example, mark the difference on the screenshot by a marker box (such as a dashed box).
图4a~图4d示出了为根据本发明实施例的差异展现的结果,图中的左侧相当于t1时刻的页面截图,右侧的是相当于t2时刻的页面截图;如图4a所示,为网页新增元素的差异展现结果,图中右侧用虚线框标记出网页新增的元素,新增的元素为“i am new here”;如图4b所示,为网页删除元素的差异展现结果,图中左侧用虚线框标记网页删除的元素,删除的元素为“你好”;如图4c所示,为网页修改元素样式的差异展现结果,图中右侧用虚线框标记网页修改的元素样式即修改了hello world样式,包括将字体倾斜,并添加了下划线;如图4d所示,为网页修改元素文本的差异展现结果,图中右侧用虚线框标记网页修改的元素文本,原为“百度”和“新浪”,修改后变为“百-度”和“新-浪”。4a-4d show the result of the difference presentation according to the embodiment of the present invention, the left side of the figure is equivalent to the screenshot of the page at time t1, and the right side is the screenshot of the page corresponding to the time t2; as shown in FIG. 4a For the difference between the new elements of the webpage, the new elements in the webpage are marked with a dashed box on the right side. The new element is “i am new here”; as shown in Figure 4b, the difference of the elements is deleted for the webpage. The result is displayed. The left side of the figure marks the deleted element of the webpage with a dotted frame, and the deleted element is “Hello”; as shown in FIG. 4c, the result of the difference of the element style is modified for the webpage, and the webpage is marked with a dotted frame on the right side of the figure. The modified element style has modified the hello world style, including tilting the font and adding an underscore; as shown in Figure 4d, the difference between the element texts is modified for the web page, and the right side of the figure is marked with the dotted box to mark the modified element text of the web page. Originally known as "Baidu" and "Sina", it was changed to "100-degree" and "new-wave".
上述内容详细描述了本发明实施例提供的网页变化监控方法,在对不同时刻的进行网页快照存储时,会得到一个页面截图图片文件(png或jpeg)和一个JSON文件(记录元素信息的数据结构),把这两个文件存到电脑硬盘上,在对比快照时,把不同时刻的保存的JSON文件做数据对比,如果发现有差异就标记在两张截图上。The foregoing describes in detail the webpage change monitoring method provided by the embodiment of the present invention. When the webpage snapshot is stored at different moments, a page screenshot image file (png or jpeg) and a JSON file (data structure of the record element information) are obtained. ), save these two files to the computer hard disk, compare the saved JSON files at different times when comparing snapshots, and mark them on two screenshots if there are differences.
与上述网页变化监控方法相对应,本发明提供一种网页变化监控装置。图5示出了根据本发明实施例的网页变化监控装置的逻辑结构。Corresponding to the webpage change monitoring method described above, the present invention provides a webpage change monitoring apparatus. FIG. 5 shows a logical structure of a webpage change monitoring apparatus according to an embodiment of the present invention.
如图5所示,本发明实施例提供的网页变化监控装置500,包括页面截图单元510、页面数据记录单元520、差异确定单元530、差异标记单元540。As shown in FIG. 5, the webpage change monitoring apparatus 500 provided by the embodiment of the present invention includes a page screenshot unit 510, a page data recording unit 520, a difference determining unit 530, and a difference marking unit 540.
其中,页面截图单元510用于对同一网页不同时刻加载后的页面进行截图保存。The page screenshot unit 510 is configured to save a screenshot of a page loaded at different times on the same webpage.
页面数据记录单元520用于分别记录同一网页在不同时刻加载后的页面数据;其中,将同一网页不同时刻加载后的页面数据记录为相应地特定数据结构。The page data recording unit 520 is configured to separately record page data loaded by the same web page at different times; wherein the page data loaded at different times of the same web page is recorded as a corresponding specific data structure.
差异确定单元530用于对比不同时刻记录的特定数据结构,确定同一网页在不同时刻加载后的页面数据之间的差异。 The difference determining unit 530 is configured to compare the specific data structures recorded at different time points, and determine the difference between the page data after the same web page is loaded at different times.
差异标记单元540用于将该差异分别标记在不同时刻的页面截图上。The difference marking unit 540 is used to mark the difference on the screenshot of the page at different times.
图6示出了根据本发明实施例的网页变化监控装置的一个具体实施方式的逻辑结构。如图6所示,页面数据记录单元520包括DOM节点访问模块521、元素信息记录模块522、元素样式拼接模块523和元素信息存储模块524。FIG. 6 shows a logical structure of a specific embodiment of a webpage change monitoring apparatus according to an embodiment of the present invention. As shown in FIG. 6, the page data recording unit 520 includes a DOM node access module 521, an element information recording module 522, an element style splicing module 523, and an element information storage module 524.
其中,DOM节点访问模块521用于访问不同时刻加载的网页中的指定的DOM节点及其子节点;元素信息记录模块522用于记录DOM节点中的元素样式、元素属性信息、元素内容、元素占位信息及元素标签;元素样式拼接模块523用于将记录的元素样式拼接成字符串,对字符串求哈希值;元素信息存储模块524用于将元素的元素标签、元素占位信息、元素属性及属性值以及哈希值进行序列化后存储为特定的数据结构。其中,所述特定的数据结构通常选用JSON结构。The DOM node accessing module 521 is configured to access the specified DOM node and its child nodes in the webpage loaded at different moments; the element information recording module 522 is configured to record the element style, the element attribute information, the element content, and the element occupying in the DOM node. Bit information and element label; element style splicing module 523 is used to splicing the recorded element style into a string, and hashing the string; element information storage module 524 is used to set element label, element place information, element Attributes and attribute values and hash values are serialized and stored as a specific data structure. Wherein, the specific data structure usually adopts a JSON structure.
另外,在DOM节点访问模块访问的指定节点是页面中所有的DOM节点或者是对所有的DOM节点进行过滤处理后的DOM节点。In addition, the designated node accessed by the DOM node access module is all DOM nodes in the page or DOM nodes that are filtered by all DOM nodes.
另外,差异确定单元530根据LCS算法对比不同时刻记录的特定数据结构,确定同一网页在不同时刻加载后的页面数据之间的差异。In addition, the difference determining unit 530 determines the difference between the page data after the same web page is loaded at different times according to the LCS algorithm comparing the specific data structures recorded at different times.
再者,差异标记单元540根据差异的类型以及差异在页面上的位置将差异标记在不同时刻的页面截图上。Moreover, the difference marking unit 540 marks the difference on the page screenshot at different times according to the type of the difference and the position of the difference on the page.
本发明还相应提供一种设备终端,参见图7,该设备终端700包括用于存储页面截图与快照数据的文件系统710和网页变化监控装置500,该网页变化监控装置包括:The present invention further provides a device terminal. Referring to FIG. 7, the device terminal 700 includes a file system 710 and a webpage change monitoring device 500 for storing page screenshots and snapshot data. The webpage change monitoring device includes:
页面截图单元,用于对同一网页不同时刻加载后的页面进行截图保存;The page screenshot unit is configured to save a screenshot of a page loaded at different times on the same webpage;
页面数据记录单元,用于分别记录同一网页在不同时刻加载后的页面数据;其中,将同一网页不同时刻加载后的页面数据记录为相应地特定数据结构;a page data recording unit, configured to separately record page data loaded by the same webpage at different times; wherein the page data loaded at different times of the same webpage is recorded as a corresponding specific data structure;
差异确定单元,用于对比不同时刻记录的特定数据结构,确定同一网页在不同时刻加载后的页面数据之间的差异;a difference determining unit, configured to compare a specific data structure recorded at different moments, and determine a difference between page data after loading the same webpage at different times;
差异标记单元,用于将该差异分别标记在不同时刻的页面截图上。A difference tag unit for marking the difference on a screenshot of the page at different times.
网页变化监控装置具有图6中所描述的结构,具体参见前面描述,此处不再赘述。The webpage change monitoring device has the structure described in FIG. 6, and the details are not described herein.
与上述网页变化监控方法相对应,本发明提供一种具有处理器可执行的程序代码的计算机可读介质,所述程序代码使处理器执行下述步骤: Corresponding to the web page change monitoring method described above, the present invention provides a computer readable medium having program code executable by a processor, the program code causing the processor to perform the following steps:
分别记录同一网页在不同时刻加载后的页面数据,并对同一网页不同时刻加载后的页面进行截图保存;其中,将同一网页不同时刻加载后的页面数据记录为相应地特定数据结构;Recording the page data of the same webpage after being loaded at different times, and saving the screenshots of the same webpage at different times; wherein the page data loaded at different times of the same webpage is recorded as a corresponding specific data structure;
通过对比不同时刻记录的特定数据结构,确定同一网页在不同时刻加载后的页面数据之间的差异;By comparing the specific data structures recorded at different times, the difference between the page data loaded by the same webpage at different times is determined;
将所述差异分别标记在不同时刻的页面截图上。The differences are marked separately on page screenshots at different times.
以上内容详细描述了本发明提供的网页变化监控方法及装置,通过对不同时间点的网页内容进行截图,以及将不同时间点的网页的元素信息记录为特定数据结构,即将同一个网页的不同时间点上的快照数据记录下来,并对任意两个时间点上的快照数据进行对比,找出差异的部分,将两个快照数据差异的部分标记在两个时间点的截图上,能够准确地监控、对比同一网页的变化。The above content describes in detail the webpage change monitoring method and apparatus provided by the present invention, by taking screenshots of webpage content at different time points, and recording element information of webpages at different time points into a specific data structure, which is different time of the same webpage. The snapshot data on the point is recorded, and the snapshot data at any two time points is compared to find the difference, and the difference between the two snapshot data is marked on the screenshot of the two time points, which can accurately monitor Compare the changes on the same page.
此外,根据本发明的方法还可以被实现为由终端中的处理器(比如CPU)执行的计算机程序,并且存储在终端的存储器中。在该计算机程序被处理器执行时,执行本发明的方法中限定的上述功能。Furthermore, the method according to the invention can also be implemented as a computer program executed by a processor (such as a CPU) in the terminal and stored in the memory of the terminal. The above-described functions defined in the method of the present invention are performed when the computer program is executed by the processor.
此外,上述方法步骤以及系统单元也可以利用控制器以及用于存储使得控制器实现上述步骤或单元功能的计算机程序的计算机可读存储设备实现。Furthermore, the method steps and system units described above may also be implemented with a controller and a computer readable storage device for storing a computer program that causes the controller to implement the steps or unit functions described above.
本领域技术人员还将明白的是,结合这里的公开所描述的各种示例性逻辑块、模块、电路和算法步骤可以被实现为电子硬件、计算机软件或两者的组合。为了清楚地说明硬件和软件的这种可互换性,已经就各种示意性组件、方块、模块、电路和步骤的功能对其进行了一般性的描述。这种功能是被实现为软件还是被实现为硬件取决于具体应用以及施加给整个系统的设计约束。本领域技术人员可以针对每种具体应用以各种方式来实现所述的功能,但是这种实现决定不应被解释为导致脱离本发明的范围。The various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the disclosure herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described. Whether such functionality is implemented as software or as hardware depends on the particular application and design constraints imposed on the overall system. A person skilled in the art can implement the described functions in various ways for each specific application, but such implementation decisions should not be construed as causing a departure from the scope of the invention.
尽管前面公开的内容示出了本发明的示例性实施例,但是应当注意,在不背离权利要求限定的本发明的范围的前提下,可以进行多种改变和修改。根据这里描述的发明实施例的方法权利要求的功能、步骤和/或动作不需以任何特定顺序执行。此外,尽管本发明的元素可以以个体形式描述或要求,但是也可以设想多个,除非明确限制为单数。While the foregoing disclosure shows exemplary embodiments of the present invention, it should be understood that various changes and modifications may be made without departing from the scope of the invention. The functions, steps and/or actions of the method claims according to the embodiments of the invention described herein are not required to be performed in any particular order. In addition, although elements of the invention may be described or claimed in the form of an individual, many are contemplated, unless explicitly limited to the singular.
虽然如上参照图描述了根据本发明的各个实施例进行了描述,但是本领域技术 人员应当理解,对上述本发明所提出的各个实施例,还可以在不脱离本发明内容的基础上做出各种改进。因此,本发明的保护范围应当由所附的权利要求书的内容确定。 Although described in accordance with various embodiments of the present invention as described above with reference to the drawings, the present technology It should be understood that various modifications may be made to the various embodiments of the present invention described above without departing from the scope of the invention. Therefore, the scope of the invention should be determined by the content of the appended claims.

Claims (11)

  1. 一种网页变化监控方法,其特征在于,包括:A webpage change monitoring method, comprising:
    分别记录同一网页在不同时刻加载后的页面数据,并对同一网页不同时刻加载后的页面进行截图保存;其中,将同一网页不同时刻加载后的页面数据记录为相应地特定数据结构;Recording the page data of the same webpage after being loaded at different times, and saving the screenshots of the same webpage at different times; wherein the page data loaded at different times of the same webpage is recorded as a corresponding specific data structure;
    通过对比不同时刻记录的特定数据结构,确定同一网页在不同时刻加载后的页面数据之间的差异;By comparing the specific data structures recorded at different times, the difference between the page data loaded by the same webpage at different times is determined;
    将所述差异分别标记在不同时刻的页面截图上。The differences are marked separately on page screenshots at different times.
  2. 如权利要求1所述的网页变化监控方法,其特征在于,所述将同一网页不同时刻加载后的页面数据记录为相应地特定数据结构的过程包括:The method for monitoring webpage change according to claim 1, wherein the process of recording page data loaded at different times of the same webpage into corresponding specific data structures comprises:
    访问不同时刻加载后的页面中的指定的DOM节点及其子节点,记录各个节点中的元素的元素标签、元素占位信息、元素样式、元素属性及属性值,将所述元素样式拼接成字符串并对所述字符串求哈希值;Accessing the specified DOM node and its child nodes in the loaded page at different times, recording the element label, element place information, element style, element attribute, and attribute value of the elements in each node, and splicing the element style into characters String and hash the string;
    将所述元素的元素标签、元素占位信息、元素属性及属性值以及所述哈希值进行序列化后存储为特定的数据结构。The element label, the element place information, the element attribute and the attribute value, and the hash value of the element are serialized and stored as a specific data structure.
  3. 如权利要求2所述的网页变化监控方法,其特征在于,所述指定节点是所述页面中所有的DOM节点或者是对所有的DOM节点进行过滤处理后的DOM节点。The webpage change monitoring method according to claim 2, wherein the designated node is all DOM nodes in the page or a DOM node that filters all DOM nodes.
  4. 如权利要求1所述的网页变化监控方法,其特征在于,所述通过对比不同时刻记录的特定数据结构,确定同一网页在不同时刻加载后的页面数据之间的差异包括:The webpage change monitoring method according to claim 1, wherein the comparing the difference between the page data after loading the same webpage at different times by comparing the specific data structures recorded at different moments comprises:
    根据LCS算法对比不同时刻记录的特定数据结构,确定同一网页在不同时刻加载后的页面数据之间的差异。The LCS algorithm compares the specific data structures recorded at different times to determine the difference between the page data loaded by the same web page at different times.
  5. 如权利要求1所述的网页变化监控方法,其特征在于,所述将所述差异分别标记在不同时刻的页面截图上包括:The webpage change monitoring method according to claim 1, wherein the step of marking the difference respectively on the screenshot of the page at different times comprises:
    根据所述差异的类型以及所述差异在页面上的位置将所述差异标记在不同 时刻的页面截图上。Marking the difference in different according to the type of the difference and the position of the difference on the page The screenshot of the moment is on the page.
  6. 一种网页变化监控装置,其特征在于,包括:A webpage change monitoring device, comprising:
    页面数据记录单元,用于分别记录同一网页在不同时刻加载后的页面数据;其中,将同一网页不同时刻加载后的页面数据记录为相应地特定数据结构;a page data recording unit, configured to separately record page data loaded by the same webpage at different times; wherein the page data loaded at different times of the same webpage is recorded as a corresponding specific data structure;
    页面截图单元,用于对同一网页不同时刻加载后的页面进行截图保存;The page screenshot unit is configured to save a screenshot of a page loaded at different times on the same webpage;
    差异确定单元,用于对比不同时刻记录的特定数据结构,确定同一网页在不同时刻加载后的页面数据之间的差异;a difference determining unit, configured to compare a specific data structure recorded at different moments, and determine a difference between page data after loading the same webpage at different times;
    差异标记单元,用于将所述差异分别标记在不同时刻的页面截图上。A difference marking unit for marking the differences on page screenshots at different times.
  7. 如权利要求6所述的网页变化监控装置,其特征在于,A web page change monitoring device according to claim 6, wherein:
    所述页面数据记录单元包括:The page data recording unit includes:
    DOM节点访问模块,用于访问不同时刻加载后的页面中的指定的DOM节点及其子节点;a DOM node access module, configured to access a specified DOM node and its child nodes in the page loaded at different times;
    元素信息记录模块,用于记录指定的DOM节点中的元素样式、元素属性信息、元素内容、元素占位信息及元素标签;An element information recording module, configured to record element style, element attribute information, element content, element place information, and element label in the specified DOM node;
    元素样式拼接模块,用于将元素样式拼接成字符串,并对所述字符串求哈希值;An element style splicing module for splicing an element style into a string and seeking a hash value for the string;
    元素信息存储模块,用于将所述元素的元素标签、元素占位信息、元素属性及属性值以及所述哈希值进行序列化后存储为特定的数据结构。The element information storage module is configured to serialize the element label, the element place information, the element attribute and the attribute value of the element, and the hash value, and store the data as a specific data structure.
  8. 如权利要求7所述的网页变化监控装置,其特征在于,在所述DOM节点访问模块访问的指定节点是所述页面中所有的DOM节点或者是对所有的DOM节点进行过滤处理后的DOM节点。The webpage change monitoring apparatus according to claim 7, wherein the designated node accessed by the DOM node access module is all DOM nodes in the page or a DOM node that performs filtering processing on all DOM nodes. .
  9. 如权利要求6所述的网页变化监控装置,其特征在于,所述差异确定单元根据LCS算法对比不同时刻记录的特定数据结构,确定同一网页在不同时刻加载后的页面数据之间的差异。The webpage change monitoring apparatus according to claim 6, wherein the difference determining unit determines a difference between page data loaded by the same webpage at different times according to a specific data structure recorded at different times according to the LCS algorithm.
  10. 如权利要求6所述的网页变化监控装置,其特征在于,所述差异标记单元根据所述差异的类型以及所述差异在页面上的位置将所述差异标记在不同时刻的页面截图上。 The webpage change monitoring apparatus according to claim 6, wherein the difference marking unit marks the difference on a page screenshot at different times according to the type of the difference and the position of the difference on the page.
  11. 一种具有处理器可执行的程序代码的计算机可读介质,其特征在于,所述程序代码使处理器执行下述步骤:A computer readable medium having processor-executable program code, the program code causing a processor to perform the steps of:
    分别记录同一网页在不同时刻加载后的页面数据,并对同一网页不同时刻加载后的页面进行截图保存;其中,将同一网页不同时刻加载后的页面数据记录为相应地特定数据结构;Recording the page data of the same webpage after being loaded at different times, and saving the screenshots of the same webpage at different times; wherein the page data loaded at different times of the same webpage is recorded as a corresponding specific data structure;
    通过对比不同时刻记录的特定数据结构,确定同一网页在不同时刻加载后的页面数据之间的差异;By comparing the specific data structures recorded at different times, the difference between the page data loaded by the same webpage at different times is determined;
    将所述差异分别标记在不同时刻的页面截图上。 The differences are marked separately on page screenshots at different times.
PCT/CN2015/090969 2014-11-17 2015-09-28 Method and device for monitoring web page changes WO2016078479A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201410652444.3 2014-11-17
CN201410652444.3A CN105630843B (en) 2014-11-17 2014-11-17 Web evolution monitoring method and device

Publications (1)

Publication Number Publication Date
WO2016078479A1 true WO2016078479A1 (en) 2016-05-26

Family

ID=56013260

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2015/090969 WO2016078479A1 (en) 2014-11-17 2015-09-28 Method and device for monitoring web page changes

Country Status (2)

Country Link
CN (1) CN105630843B (en)
WO (1) WO2016078479A1 (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110309461A (en) * 2019-07-04 2019-10-08 郑州悉知信息科技股份有限公司 Webpage representation method and apparatus
CN110795676A (en) * 2019-10-31 2020-02-14 北京知道创宇信息技术股份有限公司 Website monitoring method and device, electronic equipment and storage medium
CN111061633A (en) * 2019-12-05 2020-04-24 北京达佳互联信息技术有限公司 Method, device, terminal and medium for detecting first screen time of webpage
CN111538658A (en) * 2020-04-20 2020-08-14 卓望数码技术(深圳)有限公司 Automatic testing method for interface loading duration
WO2022018492A1 (en) * 2020-07-22 2022-01-27 Content Square SAS System and method for detecting changes in webpages and generating metric correlations therefrom
CN113987318A (en) * 2021-11-01 2022-01-28 盐城金堤科技有限公司 Page monitoring method, device, equipment and computer storage medium
CN115544969A (en) * 2022-11-29 2022-12-30 明度智云(浙江)科技有限公司 Page comparison method, equipment and medium based on hypertext markup language
US11561962B2 (en) 2020-07-22 2023-01-24 Content Square SAS System and method for detecting changes in webpages and generating metric correlations therefrom
CN111061633B (en) * 2019-12-05 2024-04-30 北京达佳互联信息技术有限公司 Webpage first screen time detection method, device, terminal and medium

Families Citing this family (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106446118A (en) * 2016-09-19 2017-02-22 中国南方电网有限责任公司信息中心 Method for automatically generating page change template
CN107870914B (en) * 2016-09-23 2020-07-31 北京京东尚科信息技术有限公司 Method and device for preventing page from being tampered
CN108073828B (en) * 2016-11-16 2022-02-18 阿里巴巴集团控股有限公司 Webpage tamper-proofing method, device and system
CN108335164A (en) * 2017-01-20 2018-07-27 阿里巴巴集团控股有限公司 A kind of method, apparatus and electronic equipment for realizing shopping at network
CN106960058B (en) * 2017-04-05 2021-01-12 金电联行(北京)信息技术有限公司 Webpage structure change detection method and system
CN108880921B (en) * 2017-05-11 2021-07-02 腾讯科技(北京)有限公司 Webpage monitoring method and device, storage medium and server
CN108595304B (en) * 2018-04-19 2022-12-27 腾讯科技(深圳)有限公司 Webpage monitoring method and device
CN110865843B (en) * 2018-08-09 2024-03-26 阿里巴巴集团控股有限公司 Page backtracking, information backup and problem solving method, system and equipment
CN109408780A (en) * 2018-09-07 2019-03-01 山东中磁视讯股份有限公司 A kind of method that Excel file is converted to JSON file
CN111898047B (en) * 2018-10-31 2024-03-29 创新先进技术有限公司 Method and device for conducting blockchain certification on webpage through webpage monitoring
CN109299352B (en) * 2018-11-14 2022-02-01 百度在线网络技术(北京)有限公司 Method and device for updating website data in search engine and search engine
CN110046072A (en) * 2019-03-13 2019-07-23 平安城市建设科技(深圳)有限公司 Monitoring method, device, terminal and the readable storage medium storing program for executing of the page
CN109978626A (en) * 2019-03-29 2019-07-05 上海幻电信息科技有限公司 Web advertisement change monitoring method, apparatus and storage medium
CN111581672A (en) * 2020-05-14 2020-08-25 杭州安恒信息技术股份有限公司 Method, system, computer device and readable storage medium for webpage tampering detection
CN112035315A (en) * 2020-07-31 2020-12-04 重庆锐云科技有限公司 Webpage data monitoring method and device, computer equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101207524A (en) * 2006-12-22 2008-06-25 上海亿动信息技术有限公司 Method and system for supervising broadcast of web advertisement
CN101782914A (en) * 2009-06-23 2010-07-21 北京搜狗科技发展有限公司 Method and system for prompting web page information
US20120173966A1 (en) * 2006-06-30 2012-07-05 Tea Leaf Technology, Inc. Method and apparatus for intelligent capture of document object model events
CN103885960A (en) * 2012-12-20 2014-06-25 上海明想电子科技有限公司 Method for monitoring webpage change

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1435782A (en) * 2002-01-31 2003-08-13 百度在线网络技术(北京)有限公司 Method for recording and analysis of information over network by snap shot mode
CN103246678B (en) * 2012-02-13 2018-04-27 深圳市世纪光速信息技术有限公司 A kind of web page content preview method and apparatus
CN103544213B (en) * 2013-09-16 2016-10-12 青岛英网资讯股份有限公司 Web site contents updates method of determination and evaluation and system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120173966A1 (en) * 2006-06-30 2012-07-05 Tea Leaf Technology, Inc. Method and apparatus for intelligent capture of document object model events
CN101207524A (en) * 2006-12-22 2008-06-25 上海亿动信息技术有限公司 Method and system for supervising broadcast of web advertisement
CN101782914A (en) * 2009-06-23 2010-07-21 北京搜狗科技发展有限公司 Method and system for prompting web page information
CN103885960A (en) * 2012-12-20 2014-06-25 上海明想电子科技有限公司 Method for monitoring webpage change

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110309461A (en) * 2019-07-04 2019-10-08 郑州悉知信息科技股份有限公司 Webpage representation method and apparatus
CN110309461B (en) * 2019-07-04 2023-10-27 郑州悉知信息科技股份有限公司 Page display method and device
CN110795676A (en) * 2019-10-31 2020-02-14 北京知道创宇信息技术股份有限公司 Website monitoring method and device, electronic equipment and storage medium
CN111061633A (en) * 2019-12-05 2020-04-24 北京达佳互联信息技术有限公司 Method, device, terminal and medium for detecting first screen time of webpage
CN111061633B (en) * 2019-12-05 2024-04-30 北京达佳互联信息技术有限公司 Webpage first screen time detection method, device, terminal and medium
CN111538658A (en) * 2020-04-20 2020-08-14 卓望数码技术(深圳)有限公司 Automatic testing method for interface loading duration
WO2022018492A1 (en) * 2020-07-22 2022-01-27 Content Square SAS System and method for detecting changes in webpages and generating metric correlations therefrom
US11561962B2 (en) 2020-07-22 2023-01-24 Content Square SAS System and method for detecting changes in webpages and generating metric correlations therefrom
CN113987318A (en) * 2021-11-01 2022-01-28 盐城金堤科技有限公司 Page monitoring method, device, equipment and computer storage medium
CN113987318B (en) * 2021-11-01 2024-03-12 盐城天眼察微科技有限公司 Page monitoring method, device, equipment and computer storage medium
CN115544969A (en) * 2022-11-29 2022-12-30 明度智云(浙江)科技有限公司 Page comparison method, equipment and medium based on hypertext markup language
CN115544969B (en) * 2022-11-29 2023-03-21 明度智云(浙江)科技有限公司 Page comparison method, equipment and medium based on hypertext markup language

Also Published As

Publication number Publication date
CN105630843A (en) 2016-06-01
CN105630843B (en) 2019-04-12

Similar Documents

Publication Publication Date Title
WO2016078479A1 (en) Method and device for monitoring web page changes
US20220253588A1 (en) Page processing method and related apparatus
US10324828B2 (en) Generating annotated screenshots based on automated tests
US9043698B2 (en) Method for users to create and edit web page layouts
KR102047568B1 (en) Measuring web page rendering time
US9978127B2 (en) Aligning a result image with a source image to create a blur effect for the source image
US9460062B2 (en) Local rendering of an object as an image
CN102306174B (en) Method and equipment for interacting with user based on web page elements
WO2017173781A1 (en) Video frame capturing method and device
US10049095B2 (en) In-context editing of output presentations via automatic pattern detection
BR112012030176B1 (en) apparatus and method of display control, and, recording media
CN111488259B (en) Recording method for webpage and playback method for recorded file
CN104050238A (en) Map labeling method and map labeling device
JP2018116496A (en) Difference detection device and program
CN105989166A (en) Waterfall flow type object display method, apparatus and system as well as electronic device
KR20120029013A (en) Host apparatus and web content display method thereof
CN109240664B (en) Method and terminal for collecting user behavior information
CN113204401A (en) Browser rendering method, terminal and storage medium
CN111783007B (en) Display rendering method and device, electronic equipment and storage medium
US20080082924A1 (en) System for controlling objects in a recursive browser system
CN109299352A (en) The update method of website data, device and search engine in search engine
US20140337709A1 (en) Method and apparatus for displaying web page
US20150248378A1 (en) Readability on mobile devices
CN108388463B (en) Icon processing method and device, computer equipment and storage medium
CN107704464A (en) Parse the method and device in the path of static resource

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15860591

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15860591

Country of ref document: EP

Kind code of ref document: A1