CN111079043A - Key content positioning method - Google Patents

Key content positioning method Download PDF

Info

Publication number
CN111079043A
CN111079043A CN201911236209.7A CN201911236209A CN111079043A CN 111079043 A CN111079043 A CN 111079043A CN 201911236209 A CN201911236209 A CN 201911236209A CN 111079043 A CN111079043 A CN 111079043A
Authority
CN
China
Prior art keywords
content
content block
page
web page
node
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911236209.7A
Other languages
Chinese (zh)
Other versions
CN111079043B (en
Inventor
易超
徐经纬
张舒汇
贺赞贤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Shulide Technology Co Ltd
Original Assignee
Beijing Shulide Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Shulide Technology Co Ltd filed Critical Beijing Shulide Technology Co Ltd
Priority to CN201911236209.7A priority Critical patent/CN111079043B/en
Publication of CN111079043A publication Critical patent/CN111079043A/en
Application granted granted Critical
Publication of CN111079043B publication Critical patent/CN111079043B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9566URL specific, e.g. using aliases, detecting broken or misspelled links
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/302Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system component is a software system

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Mathematical Physics (AREA)
  • Quality & Reliability (AREA)
  • Data Mining & Analysis (AREA)
  • Document Processing Apparatus (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention provides a key content positioning method, which comprises a Web page dynamic evolution monitoring stage and a key content positioning stage, wherein in the Web page dynamic evolution monitoring stage, a Web page to be monitored is periodically obtained; positioning a corresponding node in a Document Object Model (DOM) tree of the Web page according to key content in preset monitoring configuration information, and determining whether the Web page changes or not according to the corresponding node; in the key content positioning stage, the Web page content acquired in the monitoring stage is stored and used for subsequent positioning as historical data of an old version page, content blocks and key content are sequentially positioned in a new version page through extraction and fusion of features such as a key content text, a structure and the like of the old version page, finally, the mapping relation of the key content in the new version page and the old version page and the change of the element positioning mode are visually given, and a developer is assisted in repairing the existing system integration scheme.

Description

Key content positioning method
Technical Field
The invention relates to the technical field of Web application, in particular to a key content positioning method.
Background
Web applications often have dynamic changes in page structure due to upgrades, patching bugs, user experience optimization, etc. Such changes can have an impact on the interfacing between the associated systems. For example, two systems interface directly through embedded pages, a system change can result in the unavailability of the other system; for example, in public opinion monitoring through data collection, a page structure change may cause that correct public opinion content cannot be continuously acquired.
Interfacing between Web applications is an increasing demand, and such integration from the presentation layer of Web applications is now a more efficient way due to its low cost, non-intrusive nature, etc. However, Web applications often evolve dynamically, resulting in changes to the page structure that may render existing integration schemes ineffective. Meanwhile, due to unpredictability of Web application changes, no effective early warning mode exists for the changes, and the changes cannot be detected in time. Therefore, how to actively discover the change of the Web application page in time, and can relocate the key content after the change, and assist the developer to repair so as to ensure the system to continue to operate correctly becomes a problem to be considered when integrating the Web application.
The above problem can naturally be considered in two steps: the first step is change monitoring, namely, monitoring the Web page needing to be checked periodically to determine whether the page structure is changed; and secondly, positioning the content, namely positioning the required key content in the new version page according to the characteristics of the key content after detecting the change. However, the following challenges exist in the above two-step process: 1) the change of the Web page is frequent and various, and part of the Web page can be accessed only by a series of prepositive operations, such as login, click and the like, which brings difficulty to the detection of the change of the Web page; 2) the characteristics of the key content are difficult to obtain directly from HTML codes of the Web page, and certain reasoning calculation is needed; 3) the key content is generally text data in the Web page, which generally corresponds to leaf nodes of the DOM tree, and the number of features that can be extracted is relatively small, which brings challenges to the positioning of the key content.
Disclosure of Invention
The present invention provides a method for locating key content to overcome the above technical problems.
In order to solve the above problems, the present invention discloses a method for positioning key content, comprising:
periodically acquiring a Web page to be monitored;
positioning a corresponding node in a Document Object Model (DOM) tree of the Web page according to key content in preset monitoring configuration information;
when the corresponding node cannot be positioned, determining that the Web page changes;
when the corresponding node is positioned, obtaining a current content block containing key content in the monitoring configuration information from the Web page, comparing the title of the current content block with the title of an initial content block obtained from the Web page in an initialization monitoring task, and determining whether the Web page is changed;
storing the changed Web page as historical version page data;
acquiring all historical content blocks of the historical version page data, and extracting structural features and text features of each historical content block;
integrating the structural features and text features of all historical content blocks, sequentially positioning the historical target content blocks and the key contents in the historical target content blocks, and determining the first features of the historical target content blocks;
aiming at a newly monitored changed new version Web page, acquiring all subtrees of a DOM tree of the new version Web page, taking each subtree as a content block to be matched, and respectively extracting the characteristics of each content block to be matched;
traversing all the content blocks to be matched, performing similarity calculation on the characteristics of each content block to be matched and the first characteristics of the historical target content blocks, and positioning the current target content block;
extracting second characteristics of key contents in the historical target content block and characteristics of the key contents in the current target content block;
and establishing a mapping relation of the key content between the new version pages according to the second characteristics of the key content in the historical target content block and the edit distance of the DOM subtrees corresponding to the characteristics of the key content in the current target content block, and positioning the final key content in the current target content block.
Compared with the prior art, the invention has the following advantages:
the invention provides a key content positioning method, which comprises a Web page dynamic evolution monitoring stage and a key content positioning stage, wherein in the Web page dynamic evolution monitoring stage, a Web page change monitoring method which is based on a DOM tree structure and combines login state keeping, title identification, semantics and structure similarity is provided, a developer or operation and maintenance personnel registers a page change monitoring task by configuring the URL of a Web page to be monitored and key content to be acquired, and starts monitoring on the page; in the key content positioning stage, a page key content positioning technology which integrates multi-mode characteristics and is progressive step by step is provided, through extraction and integration of characteristics such as texts, structures and the like of key content of an old version page, content blocks and key content are sequentially positioned in a new version page, finally, the mapping relation of the key content in the new version page and the old version page and the change of the element positioning mode are visually provided, and a developer is assisted to repair the existing system integration scheme.
Drawings
FIG. 1 is a flow chart illustrating the steps of a method for locating key content according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of one example of a key content to content block relationship;
FIG. 3a is a schematic illustration of an old version page of a Web application;
FIG. 3b is a schematic illustration of a page of a new version after a change of a certain Web application;
FIG. 3c is a diagram of a DOM tree of an old version page of a Web application;
FIG. 3d is a diagram of a DOM tree of a page of a new version of a Web application;
FIG. 4 is an exemplary diagram of nuances between content block list items;
FIG. 5 is a schematic diagram of the overall architecture of the Web page change monitoring system;
FIG. 6a is a schematic diagram of an example of a Chinese informed netpage p _ 1;
FIG. 6b is a schematic diagram of an example of the Chinese Notification Page p _ 7;
FIG. 6c is a schematic diagram of an example of the Chinese Notification Page p _ 17;
FIG. 7a is a schematic diagram comparing the structures of the Chinese knowledge homepage p _1 and p _ 7;
FIG. 7b is a schematic diagram of the structure comparison between the Chinese knowledge homepage p _1 and p _ 17;
FIG. 7c is a schematic diagram of a Chinese Notification Web login and home page interface;
FIG. 7d is a schematic diagram of the operation log of the Chinese Hopkinson network system;
fig. 7e is the key content positioning result.
Detailed Description
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.
Aiming at the technical problems of the invention, due to the complexity and diversity of Web application, the invention has the following three challenges in realizing the dynamic evolution monitoring and the key content positioning of the Web page:
(1) complexity of the Web application itself
One aspect is the complexity of the application itself. A Web application typically includes multiple pages, some of which may not be directly accessible via a URL, and require a series of pre-processing operations, such as login, multiple clicks, etc. If a page needing change monitoring is accessed to a government affair service network, firstly, a social credit code is input in a first page and is clicked for logging in, and secondly, a new button is clicked in a second page, so that the target page can be finally jumped. This presents difficulties to the dynamic evolution monitoring process of the page.
Another aspect of complexity is the complexity of Web page changes. Web pages vary frequently and variously, and of these, the present invention is concerned with changes in the DOM tree structure or layout of the page. And further, structural changes can be divided into two types according to whether the butt joint failure between the Web applications is caused: if the existing Web application docking fails due to the change of the page structure, the change is called as a relevant change; otherwise, it is an irrelevant change. The present invention is concerned only with relevant changes therein. It is often easy for a user to observe changes in the DOM tree structure of a Web page, however, it is difficult to distinguish between relevant changes and irrelevant changes based on this information, since they all have the potential to cause changes in the structure of the page. Therefore, how to accurately distinguish the relevant changes from the irrelevant changes becomes a problem to be solved in the page dynamic evolution monitoring process.
(2) Concealment of features
In order to locate the key content in the changed page, the invention needs to find out the characteristics of the key content, and the characteristics need to have consistency before and after the change. Typically, for Web pages, the most directly accessible resource is the page's HTML code. However, the HTML code is only related to the display mode of the page, and there may be a large difference between the new and old versions of the page, and therefore, the information such as the tag and the attribute that can be directly acquired from the HTML code cannot generally satisfy the requirement of the feature. In addition, the key content usually contains certain semantic information, such as title or suggestive words; and there is some presentation form, such as containing various types of elements such as pictures, texts, hyperlinks, etc., which are combined according to a certain rule. This information is more likely to remain consistent between old and new versions of the page and therefore meets the requirements for the features. However, such information is generally not directly accessible and requires certain calculations, which makes the acquisition of features difficult.
(3) Sparsity of features
Generally, the acquired key content is text information displayed in a page, such as a title and an author of an article, the text information generally corresponds to leaf nodes in a page DOM tree, the structure is simple, and the included information is relatively less, so that the prior art cannot extract enough features from the text information, and the key content is difficult to locate in a changed page.
In view of the above challenges, referring to fig. 1, a flowchart illustrating steps of a method for positioning key content according to an embodiment of the present invention is shown, which may specifically include the following steps:
step S101, periodically acquiring a Web page to be monitored;
in order to periodically acquire a Web page to be monitored, a first step of the embodiment of the present invention is to establish a monitoring task for the Web page, and specifically includes the following steps:
step 1: receiving the monitoring configuration information of the Web page to be monitored, which is input by a user;
step 2: and generating a monitoring task for monitoring the page change of the Web page to be monitored according to the monitoring configuration information.
In various embodiments of the present invention, the monitoring configuration information may be configured by the operation and maintenance personnel. Including the URL of the Web page to be monitored. If the monitored Web application needs to log in, a login operation is also performed here to record login status information.
For each Web page, there is a corresponding data interface in the system in which it is located, and in this system, each data interface has a separate configuration page. Therefore, the configuration for monitoring the page change in the embodiment of the present invention can be added to the original configuration page of the Web page as an extension. Of course, the configuration page for page change monitoring may also be set separately. Based on the extended configuration mode, in the configuration page, the switch for monitoring the source page can turn on or off the change monitoring of the source page corresponding to the API. When this function is selected to be turned on, configuration items for page change monitoring may appear, including the URL of the source page, XPath of key content in the page, and monitoring frequency (determining the periodic time of the present invention). The pre-login configuration realized by the embodiment of the invention comprises a URL of a system login interface and a pre-login button, and a user only needs to configure the URL of the corresponding interface and click the pre-login. After the configuration is completed, the monitoring configuration of the Web page can be completed by clicking the storage button at the upper right.
In the step 2, the generation of the monitoring task is to model the configuration information of the user, generate a corresponding monitoring task object, and model the monitoring task by using a MonitorTask class. After the monitoring task is generated, the task needs to be stored so as to be read subsequently. MongoDB can be used as a storage scheme for monitoring tasks, and all tasks are stored in a Collection. MongoDB is a NoSQL type database without schema limitation, so that the data structure can be conveniently adjusted. After the task is stored, the task needs to be scheduled and executed according to the configuration timing. For example, the timed scheduling of the monitoring task is completed based on a timed task framework provided by the SpringBoot.
In specific implementation, the monitoring task is initially executed once to obtain initial Web page content, and the content is used as a basis for subsequently judging whether the page changes. Thereafter, the tasks are periodically executed by the schedule of the monitoring method. Specifically, the initialization monitoring task includes:
and step 3: acquiring initial page information of the Web page;
and 4, step 4: according to key content in preset monitoring configuration information, obtaining an initial content block containing the key content in the monitoring configuration information in the initial page information. The inventor finds that Web pages are often displayed in blocks of different content based on an observation analysis of the Web application page. Accordingly, in various embodiments of the present invention, each such block is referred to as a content block. The contents contained in the same content block are generally in the context of the same semantic, the expressed semantics are similar, and the key contents required to be acquired in the page are basically concentrated in one or more content blocks. The introduction of the content block enables the embodiment of the invention to combine the structure of the content block and the semantic (such as content block title) information of the content block to assist in more accurately judging the page change when the dynamic evolution monitoring of the Web page is realized. FIG. 2 is an example of a key content to content block relationship.
The page in step 3 is mainly obtained by the HTTP protocol request, and the HTTP request can be sent using the HTTP library of RestTemplate and Apache provided by SpringBoot.
During actual monitoring, a part of Web applications can limit the access of users to a certain extent, and some specific pages can be accessed only by pre-login. For these pages hidden after the login operation, if the login is skipped and the access to the content is attempted directly through the URL corresponding to the page, the page jumps to the login page due to the access control mechanism of the system, so that the required page content cannot be acquired, and further, whether the page is changed or not cannot be checked. Therefore, how to handle such a scenario requiring login becomes a problem that the method must consider. To address this problem, in a preferred embodiment of the present invention, it is shown that the monitoring configuration information includes login information of the user and Cookie information for verifying the login information.
The step 1 further comprises the following steps:
receiving login operation of a user aiming at the Web page, and acquiring login information of the user;
sending the login information to a server corresponding to the Web page;
receiving Cookie information which is returned by the server aiming at the login information and is used for verifying the login information;
in this way, for a Web page that can only be accessed after logging in, the step of S201 may further include: periodically sending the Cookie information and the request I of the Web page to a server corresponding to the Web page in an HTTP request header mode; and receiving the Web page returned by the server aiming at the request. By the method, the login state can be maintained through a Session periodic refreshing technology based on heartbeat, so that the Web page can be acquired to the maximum extent, and the subsequent process of the Web page is realized.
Step S102, positioning a corresponding node in a Document Object Model (DOM) tree of the Web page according to key content in preset monitoring configuration information;
firstly, all subtrees of a DOM tree of a new version page acquired by a current monitoring task are acquired, each subtree is taken as a content block to be matched, and features of the content blocks are extracted respectively. And then, according to the key content in the preset monitoring configuration information, carrying out similarity matching on the content block set in the new version page, thereby positioning the content block containing the key content in the page. The contents correspond to the respective nodes in step S102. It should be noted that the specific implementation method of step S102 is implemented according to the positioning of the existing tree nodes, and the positioning of the tree nodes, i.e. the process of positioning the required nodes in the given tree structure, generally includes a node attribute-based manner and a node path-based manner. XPath is based on the route of node in the tree, arrange the label of all nodes on the route from root node to destination node in order, to each node, add the order of this node in the brother node of the same label of the same level, finally, connect these fragments with '/', have got the XPath of this node.
Step S103, when the corresponding node cannot be positioned, determining that the Web page changes;
since the key content is set based on the initial page information of the Web page, when the system cannot find the corresponding node, it is determined that the page is changed with certainty.
Whether the page is subjected to relevant change is judged only according to whether the specified key content node can be positioned in the page, and the situation that an error node is positioned exists. To solve this problem, the present invention starts with the title of the content block. This problem is explained in detail by an example.
Taking the new version page and the old version page of a certain Web application shown in fig. 3 as an example, fig. 3a is an old version page, if the key content is an entry list of "city and county dynamic"; FIG. 3c is the DOM tree of the page at this time, from which it can be seen that the city and county dynamic entry list portion corresponds to an XPath of// ([ @ id ═ news rt1_ 1' ]/ul/li/a; FIG. 3b is the page of the new version after the change, from which it can be seen that two tags of "national file" and "province file" are newly added before the three tags of "general headline", "dynamic city and county", and "national news"; FIG. 3d is the DOM tree of the changed page, which would actually locate the list of entries for "province documents" if the key content was located still according to the XPath given earlier. This change is clearly a relevant change, but it cannot be found if it is only based on whether the specified key content can be located in the page.
In order to solve the problem, the inventor observes that the title information of the two content blocks is obviously different, the title of the content block in the old version page is dynamic in city and county, and the content block in the new version page becomes file-saving. Therefore, if the titles corresponding to the content blocks can be known, the two titles of the content blocks found in the two pages are compared with each other for the above change, so that the fact that the two titles of the "city and county dynamic" and the "province file" are not consistent can be found, and the page is judged to be changed relatively. However, these header information are usually not marked explicitly in the HTML code of the page, so the following method is proposed in the embodiment of the present invention:
when a corresponding node is located, whether an error node is located is judged, and the method comprises the following steps:
step S104, obtaining a current content block containing the key content in the Web page, comparing the title of the current content block with the title of an initial content block obtained from the Web page in an initialization monitoring task, and determining whether the Web page is changed;
specifically, the monitoring configuration information includes a hypertext markup language (HTML) code of the Web page and an XML path language (XPath) corresponding to the current content block;
the method for obtaining the title of the current content block comprises the following steps:
and 5: analyzing the HTML codes into corresponding DOM trees;
step 6: extracting a current content block CB from the DOM tree according to the XPath corresponding to the current content block;
and 7: querying a list CBList of sibling nodes similar to the CB;
and 8: obtaining subscript i of the CB in the CBList;
and step 9: assigning a current content block CB to a loop variable curNode, and starting loop until a title of the current content block is found; wherein the loop method of the loop variable comprises the following steps:
in each circulation, firstly, taking out the leftmost text node TextNode of the curNode as a candidate title node candidate, and acquiring the text content in the text node TextNode;
judging whether the text content meets the condition of being the title of the current content block or not according to the title preset characteristics;
if yes, searching a sibling node list candidates similar to candidate, and returning text content of candidates [ i ] as the title of the current content block;
and if not, taking the father node of the current node in the DOM tree as the candidate, expanding the searching range and continuing to circulate. Further, when the search range is larger than a preset stop condition, the loop is exited; if the title of the current content block is not found in the loop, an empty string is returned. The preset stop condition here is that the search range has been expanded to the entire Web page.
Step 5 to step 9 are a feasible method for realizing the title acquisition of the current content block, and are realized based on the following findings of the inventor: the layout of a Web page is typically "title + content" and the layout convention is such that the title node of a piece of content is typically the first child of its parent node, or the leftmost child of its ancestor node, most often with node labels h 1-h 6 (first to six rows). In the vicinity of the title node, there are some cases where the content is "more", or the like. In addition, the length of the title content usually does not exceed 10 Chinese characters, and does not contain punctuation and numbers.
Based on the above findings, the reason for finding the sibling node list similar to CB is that the presentation form of "title + content" of the Web page may be divided into two cases, the first case is a title plus the following actual content; the second case is a list of titles first, followed by a list of specific content for each title. In the latter case, it is necessary to determine which particular item is the title corresponding to the specified content block in the title list, and therefore, in the embodiment of the present invention, the layouts of the two pages may be processed simultaneously by finding a sibling node similar to the current content block.
In the embodiment of the present invention, determining whether the Web page changes according to the comparison result may include the following cases:
a: if the titles are not empty and equal, determining that no relevant change occurs in the Web page;
b: if the headers are not empty and are not equal, determining that the error node is positioned, and outputting a result of detecting the relevant change;
c: if the titles are all empty, calculating the semantic similarity between the current content block and the initial content block and the structural similarity of the DOM subtree, comparing the semantic similarity with a preset semantic threshold, comparing the structural similarity with a preset structural threshold, and determining whether the page is changed or not according to the comparison result.
In actual monitoring, when a content block without a title is encountered, it cannot be determined whether an error node is located, and it cannot be further determined whether the Web page is changed. Therefore, the embodiment of the invention provides a judgment method as in the case C, and the judgment is carried out by combining the semantic similarity and the structural similarity of the text of the content block.
In a specific implementation, the step of calculating a semantic similarity between the current content block and the initial content block, comparing the semantic similarity with a preset semantic threshold, and determining whether the Web page changes according to a comparison result may include:
c1: respectively extracting all text information in the current content block and the initial content block;
c2: calculating the similarity between all text information of the two content blocks;
c3: comparing the semantic similarity with a preset semantic threshold;
c4: and when the semantic similarity is lower than a preset semantic threshold, determining that the Web page changes, and outputting a result of detecting the relevant change.
There are many mature methods for calculating semantic similarity of texts in content blocks, for example, the semantic similarity can be calculated by using some trained corpus models, which are not described in detail herein. It should be noted that, according to different settings of the preset semantic threshold, the determination result of the change of the Web page is different, and C4 is an example of the present invention. That is, in another possible implementation case, C4 may also be: and when the semantic similarity is higher than a preset semantic threshold value, determining that the Web page changes.
In case C, in an embodiment of the present invention, the calculation of the structural similarity of the content blocks may be performed based on the edit distance or the alignment distance between DOM subtrees. Comparing the structural similarity with a preset structural threshold, and determining whether the Web page changes according to the comparison result, wherein the preset structural threshold is set differently, and the determination result of the Web page change is different, which can refer to the determination mode of C4.
The edit distance of the tree is derived from the edit distance of the sequence, that is, one tree is changed into another tree through operations of adding, deleting, changing and the like. The smaller the edit distance, the higher the similarity between trees.
Formally, for a rooted tree T, if each node of T is assigned a symbol from the finite character set Σ, the tree is referred to as a labeled tree (labeled tree). Further, if the left-to-right order of each sibling set of nodes in T is given, we get this tree as an ordered tree. The operations performed on this labeled tree are defined as follows:
modifying labels (relabel), modifying labels of nodes in T;
delete (delete), delete a non-root node, and set up the child node of the deleted node as the child node of its father node;
the edit distance of the tree can be subdivided into different subproblems according to the different operation constraints: the first is the edit distance of the ordinary tree without any restriction on the operation; the second is the alignment distance (tree alignment distance) of the tree that the insert operation must precede the delete operation.
Let T be a rooted tree, T (v) represent subtrees of T with node v as root, and θ represent empty trees. The set of trees is a forest, denoted F, an ordered forest if the order of the trees in F is given, and F (v) denotes a forest consisting of subtrees of node v. The labels of the nodes in T are from a limited character set Σ,
Figure BDA0002304950020000083
is a special blank symbol, sigmaλ=Σ∪λ。γ:(Σλ×Σλ) \ \ λ, λ) → R are distance functions between pairs of labels, satisfying the triangular inequality.
If each of the above operations on a tree is given an overhead, the algorithm calls for a sequence of operations that can translate one tree into another with minimal overhead. For example, the present invention will be described with reference to the basic algorithm concept, and the other improved algorithm concepts will be similar to the basic algorithm concept.
Formally, if used (l)1→l2) Denotes an editing operation on a tree, where γ (l)1,l2)∈(∑λ×∑λ) \ (λ, λ). Then l2λ denotes a deletion operation of a node, l1λ represents the insertion operation of the node, otherwise, the modify tag operation. The overhead per editing operation is γ (l)1→l2)=γ(l1,l2) Overhead of the entire edit sequence S
Figure BDA0002304950020000081
Thus, the edit distance δ (T) of the tree1,T2) Can be defined as: delta (T)1,T2) Min { γ (S) | S is defined as1Conversion to T2The editing operation sequence of (1). The above definition can be easily extended to forests, delta (F)1,F2) Represents a forest F1And F2The editing distance between the two trees, in this scenario, the root node of each tree may be deleted, or several trees may be merged by adding a new root node. Let F-v denote the deletion of node v from forest F, and F-T (v) denote the deletion of the subtree rooted at v from forest F. From the idea of dynamic programming, the following recursions can be derived:
δ(θ,θ)=0 δ(F1,θ)=δ(F1-v,θ)+γ(v→λ);
δ(θ,F2)=δ(θ,F2-w)+γ(λ→w)
Figure BDA0002304950020000082
from this, δ (F) can be calculated1,F2) To thereby obtain δ (T)1,T2). The algorithm is only an example introduced by the idea of the algorithm, and the algorithm complexity is O (| F)1|2|F2|2). The complexity of the algorithm for the problem can reach O (| T) at the lowest level1||T2|). The above ideas are the prior art, and the present invention is only introduced to facilitate understanding by those skilled in the art, and will not be described herein.
Preferably, the edit distance of the tree is used as a calculation method; the step of calculating the structural similarity of the DOM subtree of the current content block and the initial content block may comprise: arranging the tags in the DOM trees of the current content block and the initial content block into character strings in sequence; and calculating the structural similarity according to the editing distance of the character string between the two content blocks and the key content.
During actual monitoring, the list elements contained in the content block bring some special problems to the calculation of the structural similarity of the content block, and the most important problem is that the number of the list elements is not fixed. The change of the number of the list elements before and after can cause the addition and deletion of nodes in the content block, thereby influencing the structure of the content block. However, the inventors believe that such variations should not be taken into account when calculating structural similarity, since typically developers will treat this part as a list in its entirety, rather than focusing on individual list items. Therefore, the addition and deletion of list items generally does not invalidate the processing of the list. To solve this problem, calculating the structural similarity according to the editing distance of the character string between two content blocks and the key content may be processed by the following steps:
arranging the tags in the respective DOM subtrees of the compressed current content block and the compressed initial content block into character strings in sequence;
wherein the step of compressing the content block comprises:
assigning a content block CB to a content block CCB subjected to list item compression, removing all child nodes in a child node list children of the CCB, and initializing a new child node list cchildren to be calculated into an empty list;
performing depth-first traversal on the content block, and performing list item compression operation on each child node of the content block in sequence recursively to obtain compressed child nodes cchild;
searching whether a child node similar to the cchild structure exists in the cchildren;
when a child node similar to the cchild structure is not found, adding the compressed cchild into the cchildren;
the cchildren is assigned as children for the CCB and the CCB is returned.
The embodiment of the invention provides a compression algorithm for list elements in a content block, which compresses similar list items in a bottom-up mode, and finally only one item is reserved as the description of a list item structure. The final output is a DOM sub-tree of compressed content blocks based on which subsequent structural checks are based to eliminate effects due to variations in the number of list items.
It should be noted that, in the embodiment of the present invention, there is a case where similar nodes are searched instead of identical nodes: many Web pages use some special structure for the list items in order to emphasize important content. As shown in fig. 4, for the most recent content, the page adds an additional sup tag in the list item's a-tag, whereas the list item that is relatively older does not have this tag. Such suggestive nodes are generally independent of actual content and are not included in critical content, so that such subtle structural differences do not affect the interfacing between systems. If only list item nodes with the same structure are compressed, it may happen that the newly acquired page has no new content, and the structure of the content block is mistakenly considered to be changed. To be able to handle this situation, the algorithm tries to find similar nodes instead of identical nodes. Secondly, regarding the measurement of the node similarity, considering that the difference between similar nodes is small, the algorithm arranges the labels in the DOM subtrees corresponding to the nodes in sequence into character strings according to a mode of front-end traversal, and then judges the similarity of the nodes by combining the specified key content according to the editing distance of the character strings.
In summary, through steps S101 to S104, the auxiliary developer actively discovers the relevant change on the page structure of the target Web application in time to determine whether the existing inter-system docking has failed, so as to overcome the problem that the page in the existing Web application cannot be directly accessed through the URL, and a series of pre-operations, such as login, multiple clicks, and the like, may be required to be performed, which may cause a certain difficulty in acquiring the content of such a page.
Next, how to locate the key content in the changed page and assist the developer in repairing the integrated docking between the Web applications is a problem that needs to be solved by the present invention. With continued reference to fig. 1, the method may further include the following steps:
step S105, storing the changed Web page as historical version page data;
in order to facilitate development, after a page change is detected, in the embodiment of the present invention, when it is determined that the Web page has no relevant change, the content of the Web page acquired by the monitoring task is stored as the historical version page data of the Web page. The historical version page data stored in the embodiment of the invention can be used as the basis for positioning the subsequent key content.
Step S106, obtaining all historical content blocks of the historical version page data, and extracting structural features and text features of each historical content block;
step S107, integrating the structural features and the text features of all the historical content blocks, sequentially positioning the historical target content blocks and the key contents in the historical target content blocks, and determining the first features of the historical target content blocks;
the positioning of the key content needs to be based on the invariant features of the key content, however, the key content is usually text information in a page, the structure is simpler, and the contained feature information is relatively less; however, the distribution of the key content in the Web page is usually more concentrated and is contained in a relatively larger content block, and the content block generally has a richer structure, so that richer features can be obtained. Therefore, the present invention divides the key content location process into two steps: the location of the content block and the location of the key content within the content block. This step extracts features of the content block, including structural features, textual features, and the like.
In this embodiment of the present invention, the structural feature may be obtained from a structure of a DOM subtree corresponding to each history content block, including:
the ratio of the height of the DOM subtree to the height of the DOM tree corresponding to the entire page; and the DOM sub-tree height is the length of the longest path from the root node of the DOM sub-tree corresponding to the history content block to the leaf node of the DOM sub-tree.
The ratio of the number of nodes of the DOM sub-tree to the number of nodes of the DOM tree corresponding to the entire page; the number of nodes only considers the tag nodes and not the text nodes.
The ratio of picture nodes in the DOM sub-tree; i.e. the ratio of the number of nodes tagged with img in the DOM subtree to the total number of nodes of the DOM subtree.
The ratio of text nodes in the DOM sub-tree; i.e. the ratio of the number of nodes labeled p in the DOM subtree to the total number of nodes of the DOM subtree.
The ratio of hyperlink nodes in the DOM sub-tree; i.e. the ratio of the number of nodes labeled a in the DOM subtree to the total number of nodes of the DOM subtree.
The text features are obtained from HTML code of each historical content block, including:
a title of the content block;
text contained in the content block;
a ratio of a text length contained by the content block to a total text length of the page;
a common prefix of a hyperlink in a content block.
The above features are all for a single content block, and when performing the content block feature extraction, the embodiment of the present invention usually already saves a plurality of history pages, so there are a plurality of history content blocks. The above-mentioned features can be extracted separately for each historical content block, and then further the features of all the content blocks are integrated for positioning of the final content block (i.e. the historical target content block).
In specific implementation, for numerical characteristics in the characteristics, including various ratio information, the method in the embodiment of the invention integrates the characteristic values in a mode of averaging the characteristic values; for the titles of the content blocks, since no important change occurs in the historical content blocks, the titles of the content blocks are necessarily consistent, and therefore no further processing is needed; for texts contained in the content blocks, when the plurality of content blocks are considered, the embodiment of the invention calculates the final comprehensive characteristics by splicing the text information; finally, for the public prefix of the hyperlink in the content block, if the characteristic is equal in all the content blocks, the characteristic value is directly used, otherwise, the characteristic is a null character string. In addition to calculating a new composite feature directly from the feature of each content block, when considered from the perspective of all content blocks, embodiments of the present invention may further obtain the following features: the common text content in the content blocks, namely the text content which appears in the same position in each content block and has the appearance number more than 1.
The embodiment of the invention simply explains the selection of the above-mentioned part of features. The structural characteristics are selected based on the assumption that the basic structural characteristics of the content blocks do not change particularly greatly before and after the change, and have certain similarity. In the text feature, the title of the content block is directly related to the semantic meaning expressed by the content block, and the content block can be kept consistent before and after change. Hyperlinks are usually associated with certain file organization structures of the Web application, and the file organization structures of the Web application are generally not changed frequently, so that the contents of the hyperlinks have certain consistency before and after page change. Common text is generally words appearing in a page template, and usually reveals the semantics of data items, such as the header contents of a table, indicating the meaning of each column of data in the table, and thus has stability in variation.
In the following, some further description is given of the extraction of common text features, which may be combined with the change detection process and updated with each detection. In case C, where change detection is mentioned for key content, the similarity between content blocks can be calculated by the edit distance of the tree. During the calculation, a mapping between the nodes in the two content blocks is obtained. In each corresponding node pair, if the node is a leaf node containing a text, text information in the node is respectively extracted, and then a text character string pair can be obtained. And finally, segmenting the two text character strings and removing stop words, and then calculating the longest common sequence in the rest word sequences to obtain the common text part. And when a new Web page is obtained every time, calculating a public text part with the initial Web page according to the process, and marking the public text part, wherein the finally marked text is the public text characteristic of the content block every time.
Step S108, aiming at the newly monitored changed new version Web page, acquiring all subtrees of a DOM tree of the new version Web page, taking each subtree as a content block to be matched, and respectively extracting the characteristics of each content block to be matched;
in this step, for the extraction of the features of the content blocks of the new version Web page, reference may be made to the description of the extraction of the features of the historical content blocks in steps S106 to S107, which is not repeated herein.
Step S109, traversing all the content blocks to be matched, performing similarity calculation on the characteristics of each content block to be matched and the first characteristics of the historical target content block, and positioning the current target content block;
after the various features of each content block to be matched of the new version Web page are extracted, matching of the content blocks of the new version Web page and the old version Web page is started, namely, each content block to be matched of the new version Web page is matched with a historical target content block in the old version Web page, and similarity calculation is carried out based on the comprehensive features and the features of the content blocks to be checked. In specific implementation, the matching process of the content blocks is modeled as a searching process, and the new version page can be regarded as a set of a plurality of content blocks to be searched (to be matched), and a content block most similar to the historical target content block needs to be found from the new version page. In the matching algorithm, firstly, a content block set to be searched is extracted from a changed page, then, content blocks in the set are traversed, and similarity calculation is carried out on the content blocks and a target block. If the currently traversed content block is better than the optimal content block that has been found previously, the algorithm updates the optimal result. And after traversing all the content blocks to be searched, returning the obtained optimal content block by the algorithm.
In the above algorithm, the rule for extracting the set of content blocks to be searched from the page is as follows: 1) if the title content in the comprehensive characteristics of the target content block is not empty, a title node is firstly found out in the page of the new version according to the title content, and the node is added into the content block set to be searched. Then, finding out the father node of the node in sequence, judging whether the node is the first child node of the father node of the node, if so, adding the father node of the node into the content block set to be searched, continuously searching the father node of the father node, and carrying out the same judgment; otherwise, stopping searching, namely obtaining the content block set to be searched. 2) And if the title content in the comprehensive characteristics of the target content block is empty, directly adding all sub-trees of the DOM tree of the new version page into the content block set to be searched.
The calculation of the similarity is dependent on the interiorThe characteristics of the volume block, however, the types of the characteristic values are various, including numerical characteristics and text characteristics, and it is difficult to directly calculate the similarity, and certain preprocessing needs to be performed on the characteristics. The general processing method is to convert all kinds of features into numerical types to form feature vectors, and then calculate cosine similarity by using the feature vectors. If the feature vector of the target content block after conversion is<α1,α2,...,αn>Feature vectors of the content blocks to be examined are<b1,b2,...,bn>Then the similarity between the two content blocks is defined as:
Figure BDA0002304950020000121
the embodiment of the invention has the advantages that the characteristic preprocessing mode is that for numerical characteristics, the value of the characteristic is directly used without processing; regarding the text characteristics of the content blocks, regarding each content block to be searched and the text in the target content block as a single document, forming a corpus by all the documents, and then converting each document into a vector through TF-IDF; for the hyperlink feature, if the common hyperlink prefix of the content block to be searched is equal to the common hyperlink prefix of the target content block, the content block should have a higher similarity, and therefore, the processing on the feature is as follows: if the prefix is equal to the target prefix, the prefix is 1, otherwise, the prefix is 0; for the common text feature, the processing mode is similar to that of the hyperlink feature, that is, if the content block to be searched contains the text, the corresponding position of the vector is set to 1, otherwise, the corresponding position is set to 0. All the features have been converted to a certain extent so as to obtain a feature vector which can be used for similarity calculation.
And then positioning the current target content block in the new version Web page according to the feature vector which can be used for calculating the similarity.
Step S110, extracting the second characteristics of the key contents in the historical target content block and the characteristics of the key contents in the current target content block;
after the current target content block is located, the key content contained in the content block needs to be further mapped and located. The feature extraction concept for the key content in the target content block can refer to the description of the feature extraction of the history content block in steps S106 to S107. Specifically, a similar recursive descent mode is adopted, and a positioning process of content blocks is compared, so that if the matched content blocks are regarded as the whole webpage, the key content needing to be positioned is regarded as the content block needing to be positioned before, namely, the content blocks are amplified, and the key content can be positioned in the content blocks by adopting the same matching process. First, features of the key content are still extracted, including tags, attributes, relative positions in the content blocks, etc. of the key content.
After locating a content block containing key content, the key content needs to be relocated inside the content block, since there may also be differences in the structure inside the content block. As mentioned earlier, the key content basically corresponds to the leaf nodes of the page DOM tree, so the extraction of its features is more based on the tags of the nodes where the data is located and some characteristics of its text content, i.e. the second features of the key content in the historical target content block are obtained from the leaf nodes of the page DOM tree corresponding to the key content in the historical target content block, and specifically includes: a label of the node; the length of the node text; a data mode of the node text; the id and class attributes of the node.
Step S111, establishing a mapping relation between the key content in the new version page according to the second characteristics of the key content in the historical target content block and the edit distance of the DOM subtree corresponding to the characteristics of the key content in the current target content block, and positioning the final key content in the current target content block.
The matching of the key content is different from the matching of the content blocks, and the mapping relation between the key content nodes is calculated by adopting the editing distance of the tree for the matching of the key content. The label of the node, the length of the node text, the data pattern of the node text and the id and class attributes of the node are used for calculating the matching cost of the node pair, namely the gamma function mentioned in the case C of the embodiment of the invention. The same first pre-processing of the features:
processing node tag characteristics, namely selecting 10 common HTML tags, adding other tags to form 11 categories, and converting the characteristics into 11-dimensional vectors in a One-Hot Encoding (One-Hot Encoding) mode;
processing the length characteristics of the node text, and directly adopting the length value as a one-dimension of a final characteristic vector;
the embodiment of the invention predefines the common formats of date, mailbox, telephone, temperature and other information in a regular expression mode, judges whether the text content belongs to a certain information category according to the result of regular matching, and finally converts the characteristics into vectors in a unique hot coding mode.
The id and class characteristics are classified according to whether they are equal to the id and class of the target content.
After the second characteristic of the key content in the historical target content block and the characteristic of the key content in the current target content block are preprocessed, the mapping relation of the key content between new version pages is established according to the editing distance of each characteristic (the second characteristic or the characteristic of the key content in the current target content block) corresponding to each DOM sub-tree, the minimum editing distance can obtain one mapping relation of nodes in the tree, and the mapping relation is used as the mapping between final key content, so that the final key content is positioned in the current target content block, and the positioning of the final key content is completed.
In summary, in steps S105 to S111, a page key content positioning technology that integrates multi-modal features and advances step by step is provided, and by extracting and integrating features such as texts and structures of key content of an old version of a page, content blocks and key content are sequentially positioned in a new version of the page, and finally, a mapping relation of the key content in the new version of the page and a mapping relation of the key content in the old version of the page are visually given, and changes in element positioning modes of the key content are visually given, so as to assist a developer in repairing an existing system integration scheme.
In order to realize the practical application of the monitoring method, a system applying the Web page can package and serve the monitoring process of the embodiment of the invention and provide friendly user interaction. The overall architecture of the system is shown in fig. 5, and is mainly divided into a front end and a back end. The back end comprises the monitoring task management module mentioned above, which is further divided into a monitoring task storage module and a monitoring task scheduling module, and further comprises a page acquisition module, a page storage module, a page change detection module, and a key content positioning module. Besides, a change notification module and a system state self-checking module are added. The change notification module is responsible for sending a page change notification to a developer or a system operation and maintenance worker after detecting the change and completing the positioning process of the key content so as to know the change of the page in time and respond to the change; the system state self-checking module is a module for checking the running state of the system, and is crucial to checking the state of the system since the system may be integrated with other application systems in a microservice manner. The backend interface includes a functional interface and an interface for externally providing access to system functions, and the backend interface of the embodiment of the present invention is shown in table 1.
TABLE 1
Interface name Description of the invention
Monitoring task registration interface The monitoring configuration is used for receiving the user and generating a page monitoring task
Watch list acquisition interface For obtaining all current registered page monitoring tasks
Monitoring task modification interface For to alreadyModification of the configuration of the existing page monitoring task
Monitoring task deleting interface For deleting a certain page monitoring task
Monitoring task start/stop interface Operations for performing pause/restart on page monitoring tasks
Monitoring result acquisition interface For obtaining the result of a certain page monitoring task
System running state acquisition interface For obtaining the current running state of the whole system
The front end part of the system mainly comprises a monitoring task configuration interface, a monitoring task management interface, a monitoring result display interface and a system running state management interface. The monitoring task configuration interface provides the function of page monitoring configuration for the user, so that the monitoring task configuration interface comprises the URL of the target page, the XPath of the key content and the user configuration interface of the monitoring frequency. In addition, for the target page which can be accessed only when login is needed, the interface provides an interface for the user to perform system pre-login, and the initial login state information is stored by matching with the back end. After the user configuration is completed, configuration information is submitted through a submitting interface of the interface, and the configuration of page monitoring is submitted to a back end to register a new page monitoring task.
The monitoring task management interface displays a registered page monitoring task list for a user, and provides an interface for life cycle management of editing, starting/stopping, deleting and the like of the monitoring task. The interface also includes an entry for triggering the monitoring task configuration interface for initiating the configuration of a new monitoring task. The status of the monitoring task is also simply displayed in the interface, and the status information is, for example, whether a change in the interface is detected, so that the user can know the current monitoring result. The interface provides an interface for a user to search for monitoring tasks according to the monitoring state, the target page and the like, and is used for quickly searching for a certain monitoring task. Meanwhile, the interface comprises an entrance of a monitoring result display interface, so that a user can obtain more detailed results of change detection and key content positioning.
The monitoring result display interface is used for displaying the monitoring result of the target page in detail, detecting page change and positioning key content, visually displaying the corresponding relation between the content blocks of the new version and the old version and the key content to a user, and assisting the user in processing the page change in the follow-up process.
The system running state management interface is used for displaying the self-checking state of the system and helping a user to know the running state of the current system.
The system running state management interface is used for displaying the self-checking state of the system and helping a user to know the running state of the current system.
Aiming at a monitoring task configuration interface, a monitoring task management interface and a monitoring result display interface, the specific implementation process can comprise the following steps: after entering the system, the user firstly enters a monitoring task management interface, and can see all registered monitoring tasks. The "status" column for each monitoring task can see the current monitoring status of the task, and the "operations" column controls the task. The upper right side of the interface is provided with an 'add' button, and the monitoring task configuration interface can be popped up by clicking the button. The configuration interface comprises an input box of the URL of the target Web page and an input box of the key content XPath capable of dynamically adding the entry. Meanwhile, the interface also comprises a switch for judging whether the target system needs to log in, if the switch is turned on, a 'pre-login' button can be clicked, at the moment, the system opens a new interface and jumps to the target Web application, a user can log in the target Web application, and the login information is recorded and stored in the configuration of the monitoring task. After configuration is complete, the user may click the "add" button to add the configured monitoring task, which may then appear in the task list. When the monitoring task detects the change of the Web page and completes the positioning process of the key content, the column of the state of the task becomes abnormal, and a monitoring result display interface is popped up by clicking the state. The monitoring result display interface visually displays the mapping relation between the content blocks and the key content in the new version page and the old version page, and the XPath of the key content in the new version page.
Next, a specific example is used to verify the effect of the key content positioning method according to the embodiment of the present invention.
Firstly, the method of the embodiment of the invention is applied to detect the change of a plurality of actual representative Web application change examples and position the key content, and the result shows that the method provided by the embodiment of the invention can detect the change of the Web page and position the required key content in the changed Web page, thereby proving the effectiveness of the method. Subsequently, the accuracy of the change detection and content location process of the method is verified on more page data sets of 18 websites capable of covering common Web page types. The result shows that the method of the embodiment of the invention has higher accuracy rate for detecting the change of the Web page and positioning the key content, and the accuracy of the method is proved.
First, example research — certain website X in china.
Taking a certain website X home page in China as an example, the example verification is carried out on the processes of change detection, content block positioning and the like. The method for login processing, title recognition, feature extraction and the like used in the processes provided by the embodiment of the invention is involved.
The experimental Web page data are all from a Web Archive website, and the pages captured by the website do not include the pages which can be accessed after the Web system logs in. Thus, to simulate a Web system that needs to be logged in to verify the processing methods proposed herein for logging in, this example utilizes the website X login interface, as well as historical data of the website X home page, at the time of experimental design to simulate a "login version" of the website X system that needs to be logged in to be able to access the home page. The system comprises 17 crawled home page data of a website X, and the home page data are identified by p _1 to p _17, wherein p _1 to p _16 are pages of the same version at different times, and p _17 is a page after version change. The contents corresponding to p _1, p _2, … and p _17 are returned in turn for each access to the system. In this embodiment, the Web page change monitoring system implemented by the embodiment of the present invention is used to perform change monitoring and key content positioning on the simulated website X system.
Fig. 6a, 6b, and 6c are three examples of representative chinese X website home pages, which correspond to p _1, p _7, and p _17, respectively, where "web-aware dynamic" is data that needs to be acquired in this example, that is, key content of the page. It can be seen that p _1 has a similar structure to p _7, but the number of specific dynamic entries is different, so that p _7 has 4 li nodes more than p _1, as shown in fig. 7a, and as described earlier, since the embodiment of the present invention does not consider it to be a relevant change for such structural change; and p _1 and p _17 have a relatively large structural difference, and the relative position of the key content "web aware dynamic" in the page also changes, as shown in fig. 7b, the specific content block cannot be located in p _17 according to the XPath information of "web aware dynamic" in p _1, so that the page has a relevant change. Therefore, for this example, the desired result is that the system gives the CHANGED result after acquiring p _17 and checking it for changes, while locating the content blocks and key content in p _ 17; before this, the result of the check on the Web page should be NO _ DIFFERENCE.
Firstly, considering the processing of system login, the embodiment uses the existing certain Y platform to perform service on the login operation of the system, and generates a corresponding login interface; similarly, the Y platform is also used for generating a data interface for the first page after the system is logged in, and the data interface directly returns the complete page content. Through one-time calling of the login interface, related Session information can be managed by the Y platform system, page content can be directly obtained through subsequent calling of a home page interface, and the Session information is updated, so that corresponding page data can be continuously obtained through the interface. The interface of login and home page is shown in fig. 7 c.
With the interface for logging in and acquiring the home page content, the monitoring of the Web system page can be started. Using these two interfaces, a monitoring task is registered in the system, which then initiates monitoring and will locate critical content after checking for changes. From the system log shown in FIG. 7d, it can be seen that the system checked for page changes at the 17 th execution of the monitoring task and given the CHANGED results, as expected. The system then performs content block positioning and key content mapping in the changed page, and for this example, the title identification algorithm provided in the embodiment of the present invention can identify that the title of the content block is "web aware dynamic", so that only the content block containing the title is checked in the new version page according to the title information, and the final content block matching result and key content mapping relationship calculated by the system are shown in fig. 7 e. It can be seen that for this example, the method of the embodiment of the present invention can accurately locate the key content in the page.
Second, example study-intellectual Property office of a province
In this example, a system of intellectual property offices of a province (an example shown in fig. 3) is taken as an example, and the correctness of the title recognition method in a special page structure and the validity in the change detection process provided by the embodiment of the present invention are mainly verified.
In this example, the two different versions of the page shown in fig. 3 and the specified XPath of the key content are selected as inputs, and the corresponding node can be located in the two versions of the page according to the specified XPath, so that it is determined whether a node with an error is located. The method first obtains XPath// [ @ id ═ newsrt1_1 ] of the content block by calculating the nearest public ancestor node, and in the page of the old version, this path corresponds to the content of "city and county dynamic". Two label pages of 'national file' and 'province file' are added in the home page of the new version. The contents of the "province File" tab page will be located in the new version page according to XPath above. In this example, it is found that the titles of the two content blocks are "city and county dynamic" and "province file", respectively, and the two titles are different, so that the detection result of CHANGED is correctly given, and the more specific page change is found, thereby proving the correctness of the title identification method provided by the embodiment of the present invention and the effectiveness thereof as an auxiliary means for change detection.
In a specific verification stage, 18 Web systems corresponding to the actual Y platform items are selected. The historical version page data of the Web systems 2014-2018 are crawled through a WebArchive website, and 2836 Web pages are contained.
Firstly, the experiment verification of the change detection process is carried out, 79 page pairs are selected from the 2836 Web pages to form 79 groups of test cases for change detection, and each group of test cases comprises Web page data of the same Web application at different times and XPath of a content block to be monitored. Of these 79 test cases, 56 groups did not change and the remaining 23 groups changed. Table 2 shows the evaluation indexes defined in this example for the process of detecting the change of the method, and this example mainly considers the accuracy and recall rate of the detection.
TABLE 2 Change detection evaluation index
Figure BDA0002304950020000161
Figure BDA0002304950020000171
TABLE 3 Change test results
Actually there is a relevant change Practically without correlation changeTransforming
Detecting a relevant change 23 2
Detecting uncorrelated variations 0 54
Table 3 shows the results of the change detection on the 79 test cases, and it can be seen that the accuracy of the Web page change detection method proposed herein reaches P23/(23 +2) 92%, and the recall rate is: and R is 23/(23+0) is 100%.
Therefore, the detection method of the embodiment of the invention has higher accuracy in detecting the change of the Web page.
Followed by a locating process of the content block. The positioning result of the content block given in this example is a recommendation list of one content block, and therefore, when the effect verification is performed on the content block positioning process, two evaluation indexes shown in table 4 are selected. And the experimental data continuously adopt the pages of the 18 Web systems, a series of old version pages and a new version page are selected for each Web system to form a test case, and 18 groups of test cases are obtained altogether. These test cases are used as input to the content block positioning process, and the positioning results output by the process are counted, and the specific results are shown in table 5.
Table 4 content block positioning evaluation index
Index name Meaning of index
Recommendation accuracy rate (P1) Ratio of target content chunk to top five of recommendation list
Best recommended accuracy rate (P2) Ratio of target content blocks to first in recommendation list
Table 5 Block location results
Number of recommendations 16
Optimal number of recommendations 14
According to the experimental results, the recommended accuracy of the content chunk locating process of this example is P1-16/18-88.9%; the best recommended accuracy is P2-14/18-77.8%. Therefore, the embodiment of the invention has better accuracy for the positioning process of the content block.
And finally, performing effect verification on the key content mapping process. 18 groups of page pairs of new and old versions are selected from the 18 Web systems, the positions of the content blocks in the pages of the new and old versions are manually specified, and the information is used as input to a key content mapping module. According to statistics, 349 key content items to be mapped are contained in all the old version content blocks, and the mapping accuracy is used as an evaluation index of the process. The final result shows that 319 key content items are correctly mapped by the key content mapping method of the example, the mapping accuracy is 91.4%, which shows that the method of the embodiment of the present invention has a higher accuracy for mapping the key content.
The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.
The key content positioning method provided by the invention is described in detail, and the principle and the implementation mode of the invention are explained by applying specific examples, and the description of the examples is only used for helping to understand the method and the core idea of the invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims (10)

1. A method for locating key content, comprising:
periodically acquiring a Web page to be monitored;
positioning a corresponding node in a Document Object Model (DOM) tree of the Web page according to key content in preset monitoring configuration information;
when the corresponding node cannot be positioned, determining that the Web page changes;
when the corresponding node is positioned, obtaining a current content block containing key content in the monitoring configuration information from the Web page, comparing the title of the current content block with the title of an initial content block obtained from the Web page in an initialization monitoring task, and determining whether the Web page is changed;
storing the changed Web page as historical version page data;
acquiring all historical content blocks of the historical version page data, and extracting structural features and text features of each historical content block;
integrating the structural features and text features of all historical content blocks, sequentially positioning the historical target content blocks and the key contents in the historical target content blocks, and determining the first features of the historical target content blocks;
aiming at a newly monitored changed new version Web page, acquiring all subtrees of a DOM tree of the new version Web page, taking each subtree as a content block to be matched, and respectively extracting the characteristics of each content block to be matched;
traversing all the content blocks to be matched, performing similarity calculation on the characteristics of each content block to be matched and the first characteristics of the historical target content blocks, and positioning the current target content block;
extracting second characteristics of key contents in the historical target content block and characteristics of the key contents in the current target content block;
and establishing a mapping relation of the key content between the new version pages according to the second characteristics of the key content in the historical target content block and the edit distance of the DOM subtrees corresponding to the characteristics of the key content in the current target content block, and positioning the final key content in the current target content block.
2. The method of claim 1, wherein prior to periodically obtaining the Web pages to be monitored, the method comprises:
receiving the monitoring configuration information of the Web page to be monitored, which is input by a user;
generating a monitoring task for monitoring the page change of the Web page to be monitored according to the monitoring configuration information;
the initialization monitoring task comprises the following steps:
acquiring initial page information of the Web page;
and according to the key content in the monitoring configuration information, obtaining an initial content block containing the key content in the monitoring configuration information in the initial page information.
3. The method of claim 1 or 2, wherein the monitoring configuration information comprises login information of the user and Cookie information for verifying the login information;
the step of receiving the monitoring configuration information of the Web page to be monitored, which is input by the user, includes:
receiving login operation of a user aiming at the Web page, and acquiring login information of the user;
sending the login information to a server corresponding to the Web page;
receiving Cookie information which is returned by the server aiming at the login information and is used for verifying the login information;
the step of periodically acquiring the Web page to be monitored comprises the following steps:
periodically sending the Cookie information and the request I of the Web page to a server corresponding to the Web page in an HTTP request header mode;
and receiving the Web page returned by the server aiming at the request.
4. The method of claim 1, wherein the monitoring configuration information comprises hypertext markup language (HTML) code of the Web page and an XML path language (XPath) corresponding to the current content block;
the method for obtaining the title of the current content block comprises the following steps:
analyzing the HTML codes into corresponding DOM trees;
extracting a current content block CB from the DOM tree according to the XPath corresponding to the current content block;
querying a list CBList of sibling nodes similar to the CB;
obtaining subscript i of the CB in the CBList;
assigning a current content block CB to a loop variable curNode, and starting loop until a title of the current content block is found; wherein the loop method of the loop variable comprises the following steps:
in each circulation, firstly, taking out the leftmost text node TextNode of the curNode as a candidate title node candidate, and acquiring the text content in the text node TextNode;
judging whether the text content meets the condition of being the title of the current content block or not according to the title preset characteristics;
if yes, searching a sibling node list candidates similar to candidate, and returning text content of candidates [ i ] as the title of the current content block;
and if not, taking the father node of the current node in the DOM tree as the candidate, expanding the searching range and continuing to circulate.
5. The method of claim 1, wherein comparing the title of the current content block to the title of an initial content block obtained from the Web page in an initialization monitoring task, and wherein determining whether the Web page has changed comprises:
comparing the title of the current content block with the title of an initial content block obtained from the Web page in an initialization monitoring task;
if the titles are not empty and equal, determining that no relevant change occurs in the Web page;
if the headers are not empty and are not equal, determining that the error node is positioned, and outputting a result of detecting the relevant change;
and if the titles are all empty, calculating the semantic similarity between the current content block and the initial content block and the structural similarity of the DOM subtree, comparing the semantic similarity with a preset semantic threshold, comparing the structural similarity with a preset structural threshold, and determining whether the Web page is changed or not according to the comparison result.
6. The method of claim 5, wherein the step of calculating the semantic similarity between the current content block and the initial content block, comparing the semantic similarity with a preset semantic threshold, and determining whether the Web page is changed according to the comparison result comprises:
respectively extracting all text information in the current content block and the initial content block;
calculating the similarity between all text information of the two content blocks;
comparing the semantic similarity with a preset semantic threshold;
and when the semantic similarity is lower than a preset semantic threshold, determining that the Web page changes, and outputting a result of detecting the relevant change.
7. The method according to claim 5, wherein the step of calculating the structural similarity of the DOM sub-tree of the current content block and the initial content block comprises:
arranging the tags in the respective DOM subtrees of the current content block and the initial content block into character strings in sequence;
and calculating the structural similarity according to the editing distance of the character strings between the two content blocks and the key content in the monitoring configuration information.
8. The method of claim 7, wherein the step of sequencing tags in respective DOM subtrees of the current content block and the initial content block into strings further comprises:
arranging the tags in the respective DOM subtrees of the compressed current content block and the compressed initial content block into character strings in sequence;
wherein the step of compressing the content block comprises:
assigning a content block CB to a content block CCB subjected to list item compression, removing all child nodes in a child node list children of the CCB, and initializing a new child node list cchildren to be calculated into an empty list;
performing depth-first traversal on the content block, and performing list item compression operation on each child node of the content block in sequence recursively to obtain compressed child nodes cchild;
searching whether a child node similar to the cchild structure exists in the cchildren;
when a child node similar to the cchild structure is not found, adding the compressed cchild into the cchildren;
the cchildren is assigned as children for the CCB and the CCB is returned.
9. The method of claim 1, wherein the structural features are obtained from the structure of the DOM subtree corresponding to each historical content block, and comprise:
the ratio of the height of the DOM subtree to the height of the DOM tree corresponding to the entire page;
the ratio of the number of nodes of the DOM sub-tree to the number of nodes of the DOM tree corresponding to the entire page;
the ratio of picture nodes in the DOM sub-tree;
the ratio of text nodes in the DOM sub-tree;
the ratio of hyperlink nodes in the DOM sub-tree;
the text features are obtained from HTML code of each historical content block, including:
a title of the content block;
text contained in the content block;
a ratio of a text length contained by the content block to a total text length of the page;
a common prefix of a hyperlink in a content block.
10. The method of claim 1, wherein the second feature is obtained from a leaf node of a page DOM tree corresponding to key content within the historical target content block, and comprises:
a label of the node;
the length of the node text;
a data mode of the node text;
the id and class attributes of the node.
CN201911236209.7A 2019-12-05 2019-12-05 Key content positioning method Active CN111079043B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911236209.7A CN111079043B (en) 2019-12-05 2019-12-05 Key content positioning method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911236209.7A CN111079043B (en) 2019-12-05 2019-12-05 Key content positioning method

Publications (2)

Publication Number Publication Date
CN111079043A true CN111079043A (en) 2020-04-28
CN111079043B CN111079043B (en) 2023-05-12

Family

ID=70313188

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911236209.7A Active CN111079043B (en) 2019-12-05 2019-12-05 Key content positioning method

Country Status (1)

Country Link
CN (1) CN111079043B (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111158973A (en) * 2019-12-05 2020-05-15 北京大学 Web application dynamic evolution monitoring method
CN111966880A (en) * 2020-08-17 2020-11-20 江苏百达智慧网络科技有限公司 Visual website content acquisition method and system
CN112417351A (en) * 2020-10-21 2021-02-26 上海哔哩哔哩科技有限公司 Method and device for determining visual track of user, computer equipment and storage medium
CN112799955A (en) * 2021-02-08 2021-05-14 腾讯科技(深圳)有限公司 Model change detection method and device, storage medium and electronic equipment
CN113177168A (en) * 2021-04-29 2021-07-27 上海云扩信息科技有限公司 Positioning method based on Web element attribute characteristics
CN113626028A (en) * 2020-05-07 2021-11-09 腾讯科技(深圳)有限公司 Page element mapping method and device
WO2021232748A1 (en) * 2020-05-21 2021-11-25 深圳市商汤科技有限公司 Data processing method and apparatus, and electronic device, storage medium and program
CN115658993A (en) * 2022-09-27 2023-01-31 观澜网络(杭州)有限公司 Intelligent extraction method and system for core content of webpage
CN116112434A (en) * 2023-04-12 2023-05-12 深圳市网联天下科技有限公司 Router data intelligent caching method and system

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101727486A (en) * 2009-12-04 2010-06-09 中国人民解放军信息工程大学 Web forum information extraction system
US20110302486A1 (en) * 2010-06-03 2011-12-08 Beijing Ruixin Online System Technology Co., Ltd Method and apparatus for obtaining the effective contents of web page
CN102890681A (en) * 2011-07-20 2013-01-23 阿里巴巴集团控股有限公司 Method and system for generating webpage structure template
CN103514203A (en) * 2012-06-27 2014-01-15 腾讯科技(深圳)有限公司 Method and system for browsing webpage in reading mode
US8655913B1 (en) * 2012-03-26 2014-02-18 Google Inc. Method for locating web elements comprising of fuzzy matching on attributes and relative location/position of element
CN103607342A (en) * 2013-11-07 2014-02-26 北京奇虎科技有限公司 Mail content loading method and apparatus
US20140317754A1 (en) * 2013-04-18 2014-10-23 F-Secure Corporation Detecting Unauthorised Changes to Website Content
US20170024472A1 (en) * 2015-07-23 2017-01-26 Green Prestige Pte. Ltd. Information retrieval method utilizing webpage visual and language features and system using thereof
CN109271477A (en) * 2018-09-05 2019-01-25 杭州数湾信息科技有限公司 A kind of method and system by internet building taxonomy library
CN109344355A (en) * 2018-09-26 2019-02-15 北京因特睿软件有限公司 Automatic returning detection and Block- matching adaptive approach and device for Web evolution

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101727486A (en) * 2009-12-04 2010-06-09 中国人民解放军信息工程大学 Web forum information extraction system
US20110302486A1 (en) * 2010-06-03 2011-12-08 Beijing Ruixin Online System Technology Co., Ltd Method and apparatus for obtaining the effective contents of web page
CN102890681A (en) * 2011-07-20 2013-01-23 阿里巴巴集团控股有限公司 Method and system for generating webpage structure template
US8655913B1 (en) * 2012-03-26 2014-02-18 Google Inc. Method for locating web elements comprising of fuzzy matching on attributes and relative location/position of element
CN103514203A (en) * 2012-06-27 2014-01-15 腾讯科技(深圳)有限公司 Method and system for browsing webpage in reading mode
US20140317754A1 (en) * 2013-04-18 2014-10-23 F-Secure Corporation Detecting Unauthorised Changes to Website Content
CN103607342A (en) * 2013-11-07 2014-02-26 北京奇虎科技有限公司 Mail content loading method and apparatus
US20170024472A1 (en) * 2015-07-23 2017-01-26 Green Prestige Pte. Ltd. Information retrieval method utilizing webpage visual and language features and system using thereof
CN109271477A (en) * 2018-09-05 2019-01-25 杭州数湾信息科技有限公司 A kind of method and system by internet building taxonomy library
CN109344355A (en) * 2018-09-26 2019-02-15 北京因特睿软件有限公司 Automatic returning detection and Block- matching adaptive approach and device for Web evolution

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
D. C. REIS; P. B. GOLGHER; A. S. SILVA; A. F. LAENDER: "Automatic Web News Extraction Using Tree Edit Distance" *
YUEKUI YANG; YAJUN DU; YUFENG HAI; ZHAOQIONG GAO: "A Topic-Specific Web Crawler with Web Page Hierarchy Based on HTML Dom-Tree" *
李朝; 彭宏; 叶苏南; 张欢; 杨亲遥: "基于DOM树的可适应性Web信息抽取" *
王宇龙等: "融合结构和内容特征提取多类型网页文本要素", 《山西大学学报(自然科学版)》 *
王琦,唐世渭,杨冬青,王腾蛟: "基于DOM的网页主题信息自动提取" *

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111158973A (en) * 2019-12-05 2020-05-15 北京大学 Web application dynamic evolution monitoring method
CN113626028A (en) * 2020-05-07 2021-11-09 腾讯科技(深圳)有限公司 Page element mapping method and device
WO2021232748A1 (en) * 2020-05-21 2021-11-25 深圳市商汤科技有限公司 Data processing method and apparatus, and electronic device, storage medium and program
CN111966880A (en) * 2020-08-17 2020-11-20 江苏百达智慧网络科技有限公司 Visual website content acquisition method and system
CN112417351A (en) * 2020-10-21 2021-02-26 上海哔哩哔哩科技有限公司 Method and device for determining visual track of user, computer equipment and storage medium
CN112417351B (en) * 2020-10-21 2022-08-19 上海哔哩哔哩科技有限公司 Method and device for determining visual track of user, computer equipment and storage medium
CN112799955A (en) * 2021-02-08 2021-05-14 腾讯科技(深圳)有限公司 Model change detection method and device, storage medium and electronic equipment
CN112799955B (en) * 2021-02-08 2023-09-26 腾讯科技(深圳)有限公司 Method and device for detecting model change, storage medium and electronic equipment
CN113177168A (en) * 2021-04-29 2021-07-27 上海云扩信息科技有限公司 Positioning method based on Web element attribute characteristics
CN113177168B (en) * 2021-04-29 2023-12-01 上海云扩信息科技有限公司 Positioning method based on Web element attribute characteristics
CN115658993A (en) * 2022-09-27 2023-01-31 观澜网络(杭州)有限公司 Intelligent extraction method and system for core content of webpage
CN116112434A (en) * 2023-04-12 2023-05-12 深圳市网联天下科技有限公司 Router data intelligent caching method and system

Also Published As

Publication number Publication date
CN111079043B (en) 2023-05-12

Similar Documents

Publication Publication Date Title
CN111079043B (en) Key content positioning method
US9489401B1 (en) Methods and systems for object recognition
Di Lucca et al. An approach to identify duplicated web pages
US7730104B2 (en) Extraction of information from structured documents
CN109726274B (en) Question generation method, device and storage medium
US20090125529A1 (en) Extracting information based on document structure and characteristics of attributes
US20100169311A1 (en) Approaches for the unsupervised creation of structural templates for electronic documents
CN109325201A (en) Generation method, device, equipment and the storage medium of entity relationship data
US20210303641A1 (en) Artificial intelligence for product data extraction
JP2010501096A (en) Cooperative optimization of wrapper generation and template detection
US20100107055A1 (en) Extraction of datapoints from markup language documents
JP5370159B2 (en) Information extraction apparatus and information extraction system
US11487844B2 (en) System and method for automatic detection of webpage zones of interest
US8577887B2 (en) Content grouping systems and methods
US20090019015A1 (en) Mathematical expression structured language object search system and search method
CN109033282B (en) Webpage text extraction method and device based on extraction template
KR101523450B1 (en) Related-word registration device, related-word registration method, recording medium, and related-word registration system
CN104268148A (en) Forum page information auto-extraction method and system based on time strings
US11928140B2 (en) Methods and systems for modifying a search result
CN109165373B (en) Data processing method and device
CN108280102B (en) Internet surfing behavior recording method and device and user terminal
CN111158973B (en) Web application dynamic evolution monitoring method
CN115438162A (en) Knowledge graph-based disease question-answering method, system, equipment and storage medium
KR102185733B1 (en) Server and method for automatically generating profile
CN104778232B (en) Searching result optimizing method and device based on long query

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant