CN111079043B - Key content positioning method - Google Patents

Key content positioning method Download PDF

Info

Publication number
CN111079043B
CN111079043B CN201911236209.7A CN201911236209A CN111079043B CN 111079043 B CN111079043 B CN 111079043B CN 201911236209 A CN201911236209 A CN 201911236209A CN 111079043 B CN111079043 B CN 111079043B
Authority
CN
China
Prior art keywords
content
content block
page
web page
node
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911236209.7A
Other languages
Chinese (zh)
Other versions
CN111079043A (en
Inventor
易超
徐经纬
张舒汇
贺赞贤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Shulide Technology Co ltd
Original Assignee
Beijing Shulide Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Shulide Technology Co ltd filed Critical Beijing Shulide Technology Co ltd
Priority to CN201911236209.7A priority Critical patent/CN111079043B/en
Publication of CN111079043A publication Critical patent/CN111079043A/en
Application granted granted Critical
Publication of CN111079043B publication Critical patent/CN111079043B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9566URL specific, e.g. using aliases, detecting broken or misspelled links
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/302Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system component is a software system

Abstract

The invention provides a key content positioning method, which comprises a Web page dynamic evolution monitoring stage and a key content positioning stage, wherein the Web page to be monitored is periodically acquired in the Web page dynamic evolution monitoring stage; positioning corresponding nodes in a Document Object Model (DOM) tree of the Web page according to key content in preset monitoring configuration information, and determining whether the Web page changes according to the corresponding nodes; and in a key content positioning stage, storing the Web page content acquired in the monitoring stage as historical data of an old version page for subsequent positioning, sequentially positioning a content block and key content in a new version page by extracting and fusing key content text, structure and other characteristics of the old version page, and finally visually giving out the mapping relation of the key content in the new and old version pages and the change of element positioning modes thereof to assist a developer in repairing the existing system integration scheme.

Description

Key content positioning method
Technical Field
The invention relates to the technical field of Web application, in particular to a key content positioning method.
Background
Web applications often cause dynamic changes in page structure due to upgrades, fix vulnerabilities, user experience optimizations, and the like. Such variations can have an impact on the interfacing between the associated systems. For example, two systems implemented by embedding pages directly interface, one system change may cause the other system to be unavailable; for another example, the public opinion monitoring accomplished through data collection, the page structure change may result in failure to continue to obtain the correct public opinion content.
Docking between Web applications is a current increasing demand, and this integration from the presentation layer of Web applications is currently a more efficient way due to its low cost, non-intrusive, etc. characteristics. However, web applications often evolve dynamically, resulting in changes to the page structure that may render existing integration schemes ineffective. Meanwhile, the unpredictability of the changes of the Web application is caused, so that an effective early warning mode does not exist for the changes, and the changes cannot be detected timely. Therefore, how to actively discover the change of the Web application page in time, and relocate the key content after the change, and assist the developer to repair, so as to ensure the system to continue to operate correctly, which becomes a problem to be considered when integrating the Web application.
The above problems can naturally be considered in two steps: the first step is change monitoring, namely monitoring the Web pages to be checked periodically to determine whether the page structure is changed or not; and the second step is content positioning, namely positioning the needed key content in the new version page according to the characteristics of the key content after detecting the change. However, there are the following challenges in the processing of the above two steps: 1) The change of the Web page is frequent and various, and part of the Web page can be accessed only through a series of prepositive operations such as login, clicking and the like, so that the detection of the change of the Web page is difficult; 2) The characteristics of the key content are difficult to directly acquire from the HTML codes of the Web page, and certain reasoning calculation is needed; 3) The key content is typically text data in a Web page, which typically corresponds to leaf nodes of the DOM tree, with relatively few features that can be extracted, which presents challenges for locating the key content.
Disclosure of Invention
The invention provides a key content positioning method for overcoming the technical problems.
In order to solve the above problems, the present invention discloses a key content locating method, comprising:
regularly acquiring a Web page to be monitored;
positioning corresponding nodes in a Document Object Model (DOM) tree of the Web page according to key contents in preset monitoring configuration information;
when the corresponding node cannot be located, determining that the Web page changes;
when the corresponding node is positioned, a current content block containing key content in the monitoring configuration information is obtained from the Web page, the title of the current content block is compared with the title of an initial content block obtained from the Web page in the initializing monitoring task, and whether the Web page changes or not is determined;
storing the Web page after the change as historical version page data;
acquiring all historical content blocks of the historical version page data, and extracting structural features and text features from each historical content block;
integrating structural features and text features of all historical content blocks, sequentially positioning a historical target content block and key contents in the historical target content block, and determining a first feature of the historical target content block;
Aiming at the newly monitored Web page with changed new version, acquiring all subtrees of the Web page DOM tree with new version, taking each subtree as a content block to be matched, and extracting the characteristics of each content block to be matched;
traversing all the content blocks to be matched, carrying out similarity calculation on the characteristics of each content block to be matched and the first characteristics of the historical target content blocks, and positioning the current target content block;
extracting second characteristics of key contents in the historical target content block and characteristics of key contents in the current target content block;
and establishing a mapping relation between the key content and a new version page according to the second characteristic of the key content in the historical target content block and the editing distance of the corresponding DOM subtrees of the characteristic of the key content in the current target content block, and positioning the final key content in the current target content block.
Compared with the prior art, the invention has the following advantages:
the invention provides a key content positioning method, which comprises a Web page dynamic evolution monitoring stage and a key content positioning stage, wherein in the Web page dynamic evolution monitoring stage, a Web page change monitoring method based on a DOM tree structure and combining login state maintenance, title identification, semantics and structural similarity is provided, and a developer or operation and maintenance personnel registers a page change monitoring task by configuring the URL of a Web page needing to be monitored and key content needing to be acquired, and starts monitoring of the page; in the key content positioning stage, a multi-mode feature fusion and stepwise progressive page key content positioning technology is provided, the key content text, the structure and other features of the page of the old version are extracted and fused, the key content and the content are sequentially positioned in the page of the new version, the mapping relation of the key content in the page of the new version and the page of the old version and the change of element positioning modes are finally visualized, and a developer is assisted to repair the existing system integration scheme.
Drawings
FIG. 1 is a flow chart of the steps of a key content location method according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of one example of key content versus content chunk;
FIG. 3a is a schematic diagram of an old version page of a Web application;
FIG. 3b is a schematic diagram of a new version of a page after a change to a Web application;
FIG. 3c is a schematic diagram of a DOM tree for an old version page of a Web application;
FIG. 3d is a schematic diagram of a DOM tree for a page of a new version of a Web application;
FIG. 4 is an exemplary diagram of nuances between content block listings;
FIG. 5 is a schematic diagram of the overall architecture of a Web page variation monitoring system;
FIG. 6a is a schematic diagram of an example p_1 of the Chinese known netpage top page;
FIG. 6b is a schematic diagram of an example p_7 of the Chinese known netpage top page;
FIG. 6c is a schematic diagram of an example of a Chinese known netpage top page p_17;
FIG. 7a is a diagram showing the comparison of the structures of the first page p_1 and p_7 of the traditional Chinese art;
FIG. 7b is a diagram showing the comparison of the structures of the first page p_1 and p_17 of the Chinese traditional page;
FIG. 7c is a diagram illustrating a China network login and home page interface;
FIG. 7d is a schematic diagram of a well-known network system operation log;
fig. 7e is a key content location result.
Detailed Description
In order that the above-recited objects, features and advantages of the present invention will become more readily apparent, a more particular description of the invention will be rendered by reference to the appended drawings and appended detailed description.
Aiming at the technical problems of the invention, due to the complexity and diversity of Web application, the invention has three challenges when realizing the dynamic evolution monitoring and the key content positioning of Web pages:
(1) Complexity of the Web application itself
One aspect is the complexity of the application itself. A Web application typically contains multiple pages, some of which may not be directly accessible via a URL, requiring a series of pre-operations, such as logging in, multiple clicks, etc. If a page needing to be monitored for change is accessed to a government service network, firstly, a social credit code is input in a first page and a login is clicked, and secondly, a new button is clicked in a second page, so that the user can finally jump to a target page. This presents difficulties for the dynamic evolution monitoring process of the page.
Another aspect of complexity is the complexity of Web page changes. The Web pages are frequently and variously changed, and among these changes, the present invention focuses on changes in the DOM tree structure or layout of the page. Structural changes can be further divided into two categories according to whether or not a Web inter-application docking failure is caused: if the existing Web application docking fails due to the change of the page structure, the change is called a relevant change; otherwise, it is an irrelevant change. The present invention focuses only on the relevant variations therein. Changes to the DOM tree structure of Web pages are often easily observed by users, however it is difficult to distinguish related changes from unrelated changes based on this information, as they all may cause changes to the structure of the page. Therefore, how to accurately distinguish the related changes from the unrelated changes becomes a problem to be solved in the page dynamic evolution monitoring process.
(2) Concealment of features
In order to locate the key content in the changed page, the invention needs to find out the characteristics of the key content, and the characteristics need to have consistency before and after the change. Typically, for Web pages, the most directly available resource is the HTML code for the page. However, HTML code is only related to the display mode of the page, and there may be a large difference between the new and old versions of the page, so that information such as a tag and an attribute, which can be directly obtained from the HTML code, cannot generally meet the requirements of the feature. In addition, the key content usually contains certain semantic information, such as titles or hinting characters; and has a certain presentation form, such as including a plurality of types of elements including pictures, texts, hyperlinks, etc., which are combined according to a certain rule. This information is more likely to be consistent between the new and old versions of the page and thus meets the requirements for features. However, such information is often not directly available and requires some computation, which makes acquisition of the features difficult.
(3) Sparsity of features
The key content obtained is generally text information displayed in the page, such as the title, author and the like of an article, and the text information generally corresponds to leaf nodes in a DOM tree of the page, has a relatively simple structure and contains relatively less information, so that the prior art cannot extract enough characteristics from the text information, and therefore, the key content is difficult to locate in the changed page.
In view of the above challenges, referring to fig. 1, a flowchart illustrating a method for positioning key content according to an embodiment of the present invention may specifically include the following steps:
step S101, regularly acquiring a Web page to be monitored;
in order to periodically acquire a Web page to be monitored, the first step of the embodiment of the present invention is to establish a monitoring task for the Web page, which specifically includes the following steps:
step 1: receiving the monitoring configuration information of the Web page to be monitored, which is input by a user;
step 2: and generating a monitoring task for monitoring the page change of the Web page to be monitored according to the monitoring configuration information.
In various embodiments of the present invention, the monitoring configuration information may be configured by an operation and maintenance person. Including the URL of the Web page that needs to be monitored. If the monitored Web application needs to log in, a log-in operation is also performed here to record log-in status information.
For each Web page, there is a separate configuration page for each data interface in the system in which it resides. Therefore, the configuration for monitoring the page change according to the embodiment of the invention can be added into the original configuration page of the Web page as an extension. Of course, the configuration page for page change monitoring may also be set separately. Based on the expansion configuration mode, in the configuration page, the change monitoring of the source page corresponding to the API can be started or closed through the switch for monitoring the source page. When this function is selected to be turned on, the configuration items for monitoring page changes will appear, including the URL of the source page, XPath of the key content in the page, and the monitoring frequency (determining the regular time of the present invention). Meanwhile, the configuration also comprises a switch for judging whether the source system needs to log in, if the source Web system needs to log in, the switch is turned on to conduct pre-login configuration, and the pre-login configuration realized by the embodiment of the invention comprises a URL of a system login interface and a button for 'pre-login', and a user only needs to configure the URL of the corresponding interface and click the 'pre-login'. After the configuration is completed, the monitoring configuration of the Web page can be completed by clicking a save button on the upper right.
And 2, modeling the configuration information of the user to generate a corresponding monitoring task object, and modeling the monitoring task by using a monitor task class. After the monitoring task is generated, the task needs to be stored so as to be read later. MongoDB may be used as a storage scheme for monitoring tasks, all of which are stored in one Collection. MongoDB is a NoSQL database, without schema restrictions, so that the data structure can be conveniently adjusted. After the task is stored, the task needs to be scheduled and executed according to the configuration timing. The timing task framework provided by the SpringBoot is used for completing the timing scheduling of the monitoring task.
In specific implementation, the monitoring task is initially executed once, and initial Web page content is obtained, wherein the content is used as a basis for judging whether the page changes later. Thereafter, the scheduling of the task monitored method is performed periodically. Specifically, the initializing the monitoring task includes:
step 3: acquiring initial page information of the Web page;
step 4: and obtaining an initial content block containing the key content in the monitoring configuration information in the initial page information according to the key content in the preset monitoring configuration information. The inventor finds that the Web page generally displays different contents in blocks based on observation and analysis of the Web application page. Thus, in various embodiments of the present invention, each such block is referred to as a content block. The content contained in the same content block is generally in the context of the same semantic meaning, the expressed semantic meaning is similar, and the key content which needs to be acquired in the page is also basically concentrated in one or a plurality of content blocks. The introduction of the content block can help to judge the page change more accurately by combining the structure of the content block and the semantic (such as the title of the content block) information of the content block when realizing the dynamic evolution monitoring of the Web page. FIG. 2 is one example of key content versus content chunk.
The obtaining of the page in the step 3 mainly obtains the content of the target Web page through an HTTP protocol request, and the sending of the HTTP request can be realized by using a httpfile library provided by SpringBoot and an HTTP protocol library provided by Apache.
In actual monitoring, some Web applications may limit access to users, and some specific pages need to be logged in advance to be accessible. For the pages hidden after the login operation, if the login is skipped and the access to the content is directly attempted through the URL corresponding to the page, the page is usually jumped to the login page due to the access control mechanism of the system, so that the required page content cannot be obtained, and whether the page is changed cannot be checked. Therefore, how to deal with such a scene requiring login becomes a problem that the method must consider. In view of this problem, it is shown in a preferred embodiment of the present invention that the monitoring configuration information includes login information of the user and Cookie information for verifying the login information.
The step 1 further comprises the following steps:
receiving login operation of a user for the Web page, and obtaining login information of the user;
the login information is sent to a server corresponding to the Web page;
Receiving Cookie information returned by the server aiming at the login information and used for verifying the login information;
thus, for the Web page that can only be accessed after logging in, the step of step S201 may further include: the Cookie information is sent to a server corresponding to the Web page together with a request I for acquiring the Web page in a HTTP request head mode at regular intervals; and receiving the Web page returned by the server for the request. By the method, the login state can be maintained by using a Session periodic refreshing technology based on heartbeat, so that the Web page can be acquired to the greatest extent, and the subsequent flow of the Web page is realized.
Step S102, positioning corresponding nodes in a Document Object Model (DOM) tree of the Web page according to key contents in preset monitoring configuration information;
firstly, all subtrees of a new version page DOM tree obtained by a current monitoring task are obtained, each subtree is used as a content block to be matched, and the characteristics of the content blocks are extracted respectively. And then, matching the similarity of the content block set in the page of the new version according to the key content in the preset monitoring configuration information, so as to locate the content block containing the key content in the page. The content corresponds to the corresponding node in step S102. It should be noted that, the specific implementation method of step S102 is implemented according to the existing positioning of the tree nodes, that is, the process of positioning the required nodes in the given tree structure generally includes a mode based on the node attribute and a mode based on the node path. Based on the path of the node in the tree, the XPath arranges the labels of all the nodes on the path from the root node to the target node in sequence, adds the sequence of the node in the brother node of the same label in the same stage for each node, and finally connects the fragments by '/' to obtain the XPath of the node.
Step S103, when the corresponding node cannot be located, determining that the Web page changes;
since the key content is set based on the initial page information of the Web page, when the system cannot find the corresponding node, it is positively judged that the page is changed.
Whether the page has relevant change is judged only according to whether the designated key content node can be positioned in the page, and the situation that the error node is positioned exists. To solve this problem, the present invention starts with the title of the content block. The problem will be described in detail below with an example.
Taking the new and old version pages of a Web application shown in fig. 3 as an example, fig. 3a is an old version page, provided that the key content is a "city-county dynamic" list of entries; FIG. 3c is a DOM tree of the page at this point, from which it can be seen that the corresponding XPath of the city-county dynamic entry list portion is///x @ id = 'newsrt1_1' ]/ul/li/a; FIG. 3b is a new version of the page after the change, from which it can be seen that the two tags of "national documents" and "province documents" are newly added before the three tags of "comprehensive headlines", "city county dynamics" and "domestic news" are originally added; FIG. 3d is a DOM tree of the changed page, which actually locates the list of entries for "province file" if the key content is still located according to the XPath given previously. This change is obviously a relevant change, but it cannot be found if it is only standard according to whether the specified key content can be located in the page.
To solve this problem, the inventors observe from the figure that there is a significant difference in the header information corresponding to the two content blocks, and that the content blocks in the old version page are entitled "city-county dynamics", and that the new version page is turned into "province file". Therefore, if the title corresponding to the content block can be known, for the above change, by comparing the titles of the content blocks found in the two pages, it can be found that the two titles of "city-county dynamics" and "province file" are not identical, and it is determined that the page has changed in association with each other. However, these header information are not usually specifically marked in the HTML code of the page, and thus the embodiments of the present invention propose the following method:
when locating the corresponding node, judging whether an error node is located, wherein the method comprises the following steps:
step S104, obtaining a current content block containing the key content in the Web page, comparing the title of the current content block with the title of an initial content block obtained from the Web page in an initialization monitoring task, and determining whether the Web page is changed or not;
specifically, the monitoring configuration information comprises hypertext markup language (HTML) codes of the Web page and XML path language (XPath) corresponding to the current content block;
The title obtaining method of the current content block comprises the following steps:
step 5: analyzing the HTML code into a corresponding DOM tree;
step 6: extracting a current content block CB from the DOM tree according to the XPath corresponding to the current content block;
step 7: querying a list CBList of sibling nodes similar to the CB;
step 8: acquiring a subscript i of the CB in the CBList;
step 9: assigning the current content block CB to a circulation variable curNode, and starting circulation until the title of the current content block is found; the circulation method of the circulation variable comprises the following steps:
in each cycle, firstly taking out the text node TextNode at the leftmost part of the curNode as a candidate title node candidate, and obtaining text content in the candidate title node candidate;
judging whether the text content meets the condition of being the title of the current content block according to the title preset characteristics;
if yes, searching a sibling node list candates similar to candates, and returning text content of candates [ i ] as the title of the current content block;
if not, taking the father node of the current node in the DOM tree as the candidate, expanding the searching range and continuing to circulate. Further, when the search range is greater than a preset stop condition, exiting the cycle; if the title of the current content block is not found in the loop, an empty string is returned. The preset stop condition here is that the search range has been expanded to the entire Web page.
Steps 5 to 9 are one possible method for achieving the title acquisition of the current content block, based on the following findings of the inventors: the layout of a Web page is typically "title + content" such that the title node of a content block is typically the first child of its parent node, or the leftmost child of its ancestor node, and in most cases its nodes are labeled h 1-h 6 (first through six rows). Nodes whose contents are words such as "more", "more" and the like sometimes appear in the vicinity of the title node. In addition, the length of the title content will not normally exceed 10 Chinese characters and will not contain punctuation and numbers.
Based on the above findings, finding a list of sibling nodes similar to a CB is because the presentation form of the Web page "title+content" may be divided into the following two cases, the first case being one title plus the immediately following actual content; the second case is that a list of titles is first followed by a list of specific contents corresponding to each title. In the latter case, it is necessary to determine in the title list which item is the title corresponding to the specified content block, and therefore, in the embodiment of the present invention, by looking for a sibling node similar to the current content block, the layout of both pages can be processed simultaneously.
In the embodiment of the present invention, according to the comparison result, determining whether the Web page changes may include the following cases:
a: if the titles are not null and equal, determining that the Web page has no relevant change;
b: if the titles are not empty and are not equal, judging that the error node is positioned, and outputting a result of detecting the related change;
c: if the titles are empty, calculating the semantic similarity between the current content block and the initial content block and the structural similarity of the DOM subtree, comparing the semantic similarity with a preset semantic threshold, comparing the structural similarity with a preset structural threshold, and determining whether the page changes according to the comparison result.
In actual monitoring, when a content block without a title is encountered, whether an error node is located or not cannot be judged any more, and whether the Web page is changed or not cannot be judged further. Therefore, the embodiment of the invention provides a judging method as in the case C, and judges by combining the semantic similarity and the structural similarity of the text of the content block.
In specific implementation, the step of calculating the semantic similarity between the current content block and the initial content block, comparing the semantic similarity with a preset semantic threshold, and determining whether the Web page changes according to the comparison result may include:
C1: extracting all text information in the current content block and the initial content block respectively;
c2: calculating the similarity between all text information of the two content blocks;
and C3: comparing the semantic similarity with a preset semantic threshold;
and C4: and when the semantic similarity is lower than a preset semantic threshold, determining that the Web page changes, and outputting a result of detecting the related change.
There are many well-established methods for calculating the semantic similarity of text in a content block, for example by means of some trained corpus models, which are not described in detail here. It should be noted that, according to different settings of the preset semantic threshold, the determination result of the change of the Web page is different, and C4 is an example of the present invention. That is, in another possible implementation, C4 may also be: and when the semantic similarity is higher than a preset semantic threshold, determining that the Web page changes.
In the case of case C, in embodiments of the present invention, the computation of the structural similarity of the content blocks may be based on the edit distance or alignment distance between DOM subtrees. And comparing the structural similarity with a preset structural threshold, determining whether the page changes according to a comparison result, wherein the preset structural threshold is set differently, and the determination result of the change of the Web page is different, and referring to the determination mode of C4.
The editing distance of the tree is extended from the editing distance of the sequence, namely, one tree is changed into another tree through operations of adding, deleting, changing and the like. The smaller the edit distance, the higher the similarity between trees.
Formally, for a rooted tree T, if each node of T is assigned a symbol from the finite set of characters Σ, then the tree is referred to as a labeled tree. Still further, if the left-to-right order of each sibling node set in T is given, we become an ordered tree (ordered tree) for this tree. The operations performed on this numbered tree are defined as follows:
modifying labels (relay) to modify labels of nodes in T;
delete (delete) a non-root node and set the child node of the pruned node as the child node of its parent node;
according to the different operation constraint conditions, the editing distance of the tree can be subdivided into different sub-problems: first, the edit distance of a normal tree without any restrictions on the operation; the second is the alignment distance (tree alignment distance) of the tree that must be prior to the delete operation for the insert operation.
Let T be a rooted tree, T (v) be the subtree of T rooted at node v, and θ be the empty tree. The set of trees is a forest, denoted F, an ordered forest if the order of the trees in F is given, F (v) representing a forest of subtrees of node v. The labels of the nodes in T come from a limited set of characters sigma,
Figure BDA0002304950020000083
Is a special blank symbol, Σ λ =Σ∪λ。γ:(Σ λ ×Σ λ ) And \\ (lambda ) →R is a distance function between tag pairs, satisfying the triangle inequality. />
If each of the above operations is given an overhead, the algorithm requires a sequence of operations that can translate one tree into another with minimal overhead. The embodiments of the present invention will be described with respect to only the most basic algorithm concepts, and other improved algorithm concepts are similar.
Formally, if (l) 1 →l 2 ) Represents an edit operation to a tree, where γ (l 1 ,l 2 )∈(∑ λ ×∑ λ ) \ (lambda ). L is then 2 Let λ denote delete operation of node, l 1 =λ represents the insert operation of the node, otherwise, the modify label operation. The overhead of each editing operation is γ (l 1 →l 2 )=γ(l 1 ,l 2 ) Overhead of the entire edit sequence S
Figure BDA0002304950020000081
Then, the edit distance δ (T 1 ,T 2 ) Can be defined as: delta (T) 1 ,T 2 ) =min { γ (S) |s is will T 1 Conversion to T 2 Is a sequence of editing operations of (a). The definition can be easily extended to forests, delta (F 1 ,F 2 ) Representing forest F 1 And F is equal to 2 Edit distance between each tree, in this scenario, the root node of each tree may be deleted, or may be deletedAdding a new root node merges the several trees. Let F-v denote the deletion of node v from forest F and F-T (v) denote the deletion of the subtree rooted at v from forest F. From the dynamic programming concept, the following recurrence can be derived:
δ(θ,θ)=0 δ(F 1 ,θ)=δ(F 1 -v,θ)+γ(v→λ);
δ(θ,F 2 )=δ(θ,F 2 -w)+γ(λ→w)
Figure BDA0002304950020000082
From this, δ (F 1 ,F 2 ) Thereby obtaining delta (T) 1 ,T 2 ). The above algorithm is merely an example of the ideas of this type of algorithm, with an algorithm complexity of O (|f) 1 | 2 |F 2 | 2 ). The presently known algorithms for this class of problems have the lowest complexity up to O (|t) 1 ||T 2 |) is provided. The foregoing ideas are all the prior art, and the present invention is merely drawn to facilitate understanding by those skilled in the art, and is not repeated herein.
Preferably, the editing distance of the tree is adopted as a calculation method; the step of calculating the structural similarity of the DOM subtree of the current content block and the initial content block may comprise: sequentially arranging labels in respective DOM trees of the current content block and the initial content block into character strings; and calculating the structural similarity according to the editing distance of the character strings between the two content blocks and the key content.
In actual monitoring, list elements contained in a content block bring special problems to the calculation of the structural similarity of the content block, and the situation that the number of the list elements is not fixed is mainly reflected. The change of the number of list elements can cause the addition and deletion of nodes in the content block, thereby influencing the structure of the content block. However, the inventors believe that such variations should not be considered in computing structural similarity, as typically a developer would treat this portion as a whole as a list rather than focusing on individual list items. Thus, the addition and deletion of list items generally does not defeat the processing of the list. For this problem, calculating the structural similarity from the edit distance of the character string between two content blocks and the key content may be performed by:
Sequentially arranging labels in respective DOM subtrees of the compressed current content block and the compressed initial content block into character strings;
wherein the step of compressing the content block comprises:
assigning a content block CB to a content block CCB subjected to list item compression, removing all child nodes in a child node list child of the CCB, and initializing a new child node list cchildren to be calculated into an empty list;
performing depth-first traversal on the content block, and sequentially recursively performing list item compression operation on each child node to obtain a compressed child node cchild;
searching for whether a child node similar to the cchild structure exists in cchildren;
adding the compressed cchild to the cchildren when the child node similar to the cchild structure is not found;
assign the cchildren to be the child of the CCB and return to the CCB.
The embodiment of the invention provides a compression algorithm for list elements in a content block, adopts a bottom-up mode to compress similar list items, and finally only one item is reserved as a description of a list item structure. A DOM subtree of the compressed content block is finally output, and subsequent structure checks based on the compressed DOM subtree to eliminate effects due to variations in the number of list items.
It should be noted that, in the embodiment of the present invention, it is the case that similar nodes are found instead of exactly the same nodes: many Web pages use special structures for this list item in order to emphasize important content. As shown in fig. 4, for the latest content, the page has an additional sup tag added to the a tag of the list item, whereas the relatively older list item has no such tag. Such hinting nodes are typically independent of the actual content and are not included in the critical content, so that such subtle structural differences do not affect the interface between systems. If only the list item nodes with identical structure are compressed, it may happen that the structure of the content block is erroneously considered to have changed because the newly acquired page has no new content. To be able to handle this situation, the algorithm tries to find similar nodes instead of exactly the same nodes. Secondly, for the measurement of node similarity, considering that the difference between similar nodes is smaller, the algorithm adopts a traversing mode according to the preamble to orderly arrange labels in DOM subtrees corresponding to the nodes into character strings, and then the similarity of the nodes is judged by combining the editing distance of the character strings with the appointed key content.
In summary, through steps S101 to S104, the auxiliary developer actively discovers the relevant changes on the page structure of the target Web application in time to determine whether the existing butt joint between systems has been disabled, so as to solve the problem that the page in the existing Web application is not necessarily directly accessible through URL and a series of pre-operations, such as login, multiple clicks, etc., may need to be performed, which results in a certain difficulty in acquiring the content of such page.
Next, how to locate key content in the changed page, and assist the developer in repairing the integration docking between Web applications is a problem that needs to be solved by the present invention. With continued reference to fig. 1, the method may specifically further include the following steps:
step S105, storing the Web page after the change as historical version page data;
in order to facilitate development, after detecting the page change, the embodiment of the invention stores the Web page content acquired by the monitoring task as historical version page data of the Web page when determining that the Web page has no relevant change. The historical version page data stored in the embodiment of the invention can be used as the basis for positioning the subsequent key content.
Step S106, acquiring all historical content blocks of the historical version page data, and extracting structural features and text features from each historical content block;
step S107, integrating the structural features and text features of all the historical content blocks, sequentially positioning the historical target content blocks and key contents in the historical target content blocks, and determining first features of the historical target content blocks;
the positioning of the key content needs to be based on the unchanged characteristics of the key content, however, the key content is usually text information in a page, the structure is simpler, and the characteristic information is relatively less; however, the distribution of the key content in the Web page is typically more concentrated and contained in a relatively larger piece of content, which is typically of a more rich structure and thus can be characterized more abundantly. Therefore, the invention divides the positioning process of the key content into two steps: positioning of content blocks and positioning of key content within content blocks. This step extracts features of the content block, including structural features, textual features, etc.
In an embodiment of the present invention, the structural feature may be obtained from a structure of a DOM subtree corresponding to each historical content block, including:
A ratio of DOM sub-tree height to DOM tree height corresponding to the entire page; wherein the DOM subtree height is the length of the longest path from the root node of the DOM subtree corresponding to the historical content block to its leaf node.
The ratio of the number of nodes of the DOM subtree to the number of nodes of the DOM tree corresponding to the entire page; the number of nodes considers only the tag nodes and not the text nodes.
Ratios of picture nodes in DOM subtrees; i.e., the ratio of the number of nodes in the DOM subtree labeled img to the total number of DOM subtree nodes.
Ratios of text nodes in the DOM subtrees; i.e., the ratio of the number of nodes in the DOM subtree labeled p to the total number of DOM subtree nodes.
Ratios of hyperlink nodes in the DOM subtrees; i.e., the ratio of the number of nodes in the DOM subtree labeled a to the total number of DOM subtree nodes.
The text features are obtained from the HTML code of each historical content block, including:
a title of the content block;
text contained in the content block;
a ratio of text length contained by the content block to total text length of the page;
common prefixes of hyperlinks in content blocks.
The above features are all directed to a single content block, and when performing content block feature extraction, embodiments of the present invention typically already maintain multiple history pages, and thus multiple history content blocks. The above features can be extracted separately for each historical content block, and then further integrated for the final content block (i.e., the historical target content block) location.
In specific implementation, for numerical type features in the features, including various ratio information, the method of the embodiment of the invention synthesizes the numerical type features by averaging the feature values; for the titles of the content blocks, since no significant change occurs in the historical content blocks, the titles of the content blocks are necessarily consistent, and therefore no further processing is required; when considering a plurality of content blocks, the embodiment of the invention calculates the final comprehensive characteristics by splicing text information; finally, for the common prefix of hyperlinks in content blocks, the feature value is directly used if this feature is equal in all content blocks, otherwise the feature is an empty string. In addition to computing from the characteristics of each content block directly to obtain new composite characteristics, embodiments of the present invention may further obtain the following characteristics when considered from the perspective of all content blocks: the common text content in the content blocks, i.e. the text content that appears at the same location in each content block and has a number of occurrences greater than 1.
The embodiment of the invention simply explains the selection of the partial characteristics. The selection of the structural features is based on the assumption that the basic structural characteristics of the content blocks before and after the change generally do not change greatly and have certain similarity. In the text feature, the title of the content block is directly related to the semantics expressed by the content block, and the content block can be consistent before and after the change. Hyperlinks are typically associated with the file organization of Web applications, which typically does not change frequently, so that the content of the hyperlinks has some consistency before and after the page changes. Common text is typically text that appears in a page template that typically reveals the semantics of the data item, such as the header content of a form, indicating the meaning of each column of the form and thus has stability in the variation.
Next, some further description will be made of the extraction of the common text feature, which may be combined with the change detection process and updated with each detection. When reference is made to the detection of changes to key content in case C, the similarity between content blocks can be calculated by the edit distance of the tree. During the calculation, a mapping between nodes in two content blocks is obtained. In each corresponding node pair, if the node is a leaf node containing text, text information in the node is extracted respectively, so that a text string pair can be obtained. Finally, the two text strings are segmented, stop words are removed, and the longest public sequence is calculated in the rest word sequences, so that the public text parts can be obtained. When a new Web page is obtained each time, calculating a public text part with the initial Web page according to the process, and marking the public text part, wherein finally, the marked text is the public text characteristic of the content block each time.
Step S108, aiming at the newly monitored Web page with changed new version, acquiring all subtrees of the Web page DOM tree with new version, taking each subtree as a content block to be matched, and extracting the characteristics of each content block to be matched;
In this step, the extraction of the features of the content blocks of the new version Web page may refer to the descriptions of the extraction of the features of the historical content blocks in steps S106 to S107, which are not repeated here.
Step S109, traversing all the content blocks to be matched, carrying out similarity calculation on the characteristics of each content block to be matched and the first characteristics of the historical target content blocks, and positioning the current target content block;
after the above various characteristics of each content block to be matched of the new version Web page are extracted, the matching of the content blocks of the new and old version Web pages is started, namely, each content block to be matched of the new version Web page is matched with the historical target content block in the old version Web page, and the similarity calculation is based on the comprehensive characteristics and the characteristics of the content blocks to be checked. In particular, when the embodiment of the invention models the matching process of the content blocks as a searching process, the page with the new version can be regarded as a set of a plurality of content blocks to be searched (to be matched), and the content blocks most similar to the historical target content blocks need to be found. In the matching algorithm, firstly, a content block set to be searched is extracted from the changed page, then, the content blocks in the set are traversed, and similarity calculation is carried out with the target block. If the currently traversed content block is better than the previously found optimal content block, the algorithm updates the optimal result. After the traversal of all the content blocks to be searched is completed, the algorithm returns the obtained optimal content block.
In the above algorithm, the rule of extracting the set of content blocks to be searched from the page is as follows: 1) If the title content in the comprehensive characteristics of the target content block is not empty, firstly, finding out a title node in the page of the new version according to the title content, and adding the node into the content block set to be searched. Then, the father node of the node is found in sequence, whether the node is the first child node of the father node is judged, if so, the father node is added into the content block set to be searched, the father node of the father node is continuously searched, and the same judgment is carried out; otherwise, stopping searching to obtain the content block set to be searched. 2) If the title content in the target content block comprehensive characteristics is empty, all subtrees of the new version page DOM tree are directly added into the content block set to be searched.
The calculation of the similarity depends on the characteristics of the content block, however, the types of the characteristic values are various, including numerical characteristics and text characteristics, the calculation of the similarity is difficult to directly perform, and certain preprocessing is needed for the characteristics. The common processing mode is to convert various characteristics into digital values and shapesForming feature vectors, and then calculating cosine similarity by using the feature vectors. For example, the feature vector of the target content block after conversion is <α 1 ,α 2 ,...,α n >The feature vector of the content block to be checked is<b 1 ,b 2 ,...,b n >The similarity between the two content blocks is defined as:
Figure BDA0002304950020000121
the pretreatment mode of the embodiment of the invention is that the numerical value type characteristic is directly used without treatment; regarding the text characteristics of the content blocks, regarding each content block to be searched and the text in the target content block as a single document, forming a corpus by all the documents, and then converting each document into a vector through TF-IDF; for a hyperlink feature, if the common hyperlink prefix of a content block to be searched is equal to the common hyperlink prefix of a target content block, the content block should have a higher similarity, and thus, the processing of the feature is: 1 if equal to the target prefix, otherwise 0; for the common text feature, the processing is similar to that of the hyperlink feature, namely, if the text is contained in the content block to be searched, the corresponding position of the vector is set to 1, otherwise, the corresponding position of the vector is set to 0. All the features have been converted to a certain extent so as to obtain a feature vector which can be used for similarity calculation.
The current target content block is then located in the new version Web page according to the feature vectors available for similarity calculation.
Step S110, extracting second characteristics of key contents in the historical target content block and characteristics of key contents in the current target content block;
after locating the current target content block, the key content contained in the content block needs to be mapped and located further. The feature extraction concept for the key content in the target content block may refer to the description of the feature extraction of the historical content block in steps S106 to S107. Specifically, a similar recursive descent mode is adopted, the positioning process of the analog content block is adopted, if the matched content block is regarded as the whole webpage, the key content to be positioned is regarded as the content block to be positioned before, namely, the content block is amplified, and the same matching process can be adopted to position the key content in the content block. First, the features of the key content are still extracted, including tags, attributes, relative locations in the content block, etc. of the key content.
After locating the content block containing the key content, a subsequent relocation of the key content inside the content block is required, as there may also be differences in the structure inside the content block. As mentioned before, the key content basically corresponds to the leaf nodes of the page DOM tree, so the extracting of the features thereof is more based on the labels of the nodes where the data is located and some characteristics of the text content thereof, namely, the second features of the key content in the historical target content block are obtained from the leaf nodes of the page DOM tree corresponding to the key content in the historical target content block, and specifically includes: a label of the node; the length of the node text; a data pattern of the node text; id and class attributes of a node.
Step S111, according to the second characteristics of the key content in the historical target content block and the edit distance of the corresponding DOM subtrees of the characteristics of the key content in the current target content block, a mapping relation of the key content between new version pages is established, and the final key content is positioned in the current target content block.
The key content matching in the embodiment of the invention is different from the key content matching in the content blocks, and the tree editing distance is adopted for the key content matching to calculate the mapping relation between the key content nodes. The label of the node, the length of the node text, the data mode of the node text and the id and class attributes of the node are used for calculating the matching cost of the node pair, namely the gamma function in the embodiment C of the invention. The features are also first preprocessed:
the processing of node label features, namely selecting common 10 HTML labels, adding other labels to form 11 categories, and converting the features into 11-dimensional vectors in a One-Hot Encoding (One-Hot Encoding) mode;
processing the length characteristics of the node text, and directly adopting a length value as one dimension of a final characteristic vector;
the method and the device for processing the node text data pattern features predefine common formats of date, mailbox, telephone, temperature and other information in a regular expression mode, judge whether the text content belongs to a certain information category or not according to a regular matching result, and finally convert the features into vectors in a single-hot coding mode.
And (3) processing the id and class characteristics, and classifying the object content according to whether the id and class characteristics are equal to those of the object content.
After preprocessing the second feature of the key content in the historical target content block and the feature of the key content in the current target content block, establishing a mapping relation between the key content and the new version page according to the editing distance of each DOM subtree corresponding to each feature (the second feature or the feature of the key content in the current target content block), and obtaining a mapping relation of nodes in the tree by the minimum editing distance, wherein the mapping relation is used as the mapping between the final key content, so that the final key content is positioned in the current target content block, and the positioning of the final key content is completed.
In summary, step S105 to step S111 provide a multi-modal feature fusion and progressive step-by-step page key content positioning technology, by extracting and fusing features such as key content text and structure of an old version page, content blocks and key content are sequentially positioned in a new version page, and finally, the mapping relation of the key content in the new and old version pages and the change of element positioning modes thereof are visually provided, so that a developer is assisted in repairing an existing system integration scheme.
In order to realize the practical application of the monitoring method, the system applying the Web page encapsulates and services the monitoring flow according to the embodiment of the invention, and meanwhile, friendly user interaction is provided. The overall architecture of the system is shown in fig. 5, and is mainly divided into a front end part and a rear end part. The back end comprises the monitoring task management module which is divided into a monitoring task storage module, a monitoring task scheduling module, a page acquisition module, a page storage module, a page change detection module and a key content positioning module. In addition, a change notification module and a system state self-checking module are added. The change notification module is responsible for sending a notification of page change to a developer or system operation and maintenance personnel after detecting the change and completing the positioning process of the key content so as to know the page change in time and respond to the page change; the system state self-checking module is a module for checking the self-running state of the system, and is crucial for checking the self-running state of the system because the system can be integrated with other application systems in a micro-service mode. The back-end interface includes a functional interface and an interface for accessing system functions provided externally, and the back-end interface in the embodiment of the present invention is shown in table 1.
TABLE 1
Interface name Description of the invention
Monitoring task registration interface For receiving user monitoring configuration and generating a page monitoring task
Monitoring list acquisition interface For obtaining all currently registered page monitoring tasks
Monitoring task modification interface For modifying the configuration of an existing page monitoring task
Monitoring task deletion interface For deleting a page monitoring task
Monitoring task start/stop interface Operations for performing pause/restart on page monitor tasks
Monitoring result acquisition interface For obtaining the result of a certain page monitoring task
System running state acquisition interface For obtaining the current operating state of the whole system
The front end part of the system mainly comprises a monitoring task configuration interface, a monitoring task management interface, a monitoring result display interface and a system running state management interface. The monitoring task configuration interface provides the user with the function of performing page monitoring configuration, so that the monitoring task configuration interface comprises the URL of the target page, XPath of key content and a user configuration interface for monitoring frequency. In addition, for the target page which can be accessed only by logging in, the interface provides an interface for the user to log in the system in advance, and the initial logging-in state information is stored in cooperation with the back end. After the user configuration is completed, the configuration information is submitted through the submitting interface of the interface, and the configuration of the page monitoring is submitted to the back end to register a new page monitoring task.
The monitoring task management interface displays a registered page monitoring task list for a user and provides an interface for managing life cycles of editing, starting/stopping, deleting and the like of the monitoring task. The interface also includes an entry triggering the monitor task configuration interface for enabling configuration of a new monitor task. The state of the monitoring task is also simply displayed in the interface, and if the state information is that the change of the interface is detected, the user knows the current monitoring result. The interface is provided for the user to search the monitoring task according to the monitoring state, the target page and the like, and is used for quickly searching a certain monitoring task. Meanwhile, the interface comprises an inlet of a monitoring result display interface, so that a user can acquire more detailed change detection and key content positioning results.
The monitoring result display interface is used for displaying the monitoring result of the target page in detail, including detection of page change and positioning of key content, visually displaying the corresponding relation between the new and old page content blocks and the key content to the user, and assisting the user in processing the page change later.
The system running state management interface is used for displaying the self-checking state of the system and helping a user to know the running state of the current system.
The system running state management interface is used for displaying the self-checking state of the system and helping a user to know the running state of the current system.
Aiming at a monitoring task configuration interface, a monitoring task management interface and a monitoring result display interface, the specific implementation process can comprise the following steps: after entering the system, the user firstly enters a monitoring task management interface, and can see all registered monitoring tasks. The "status" column of each monitoring task can see the current monitoring status of the task, and the "operations" column controls the task. The upper right of the interface is provided with an 'add' button, and clicking the button can pop up the monitoring task configuration interface. The configuration interface contains an input box for the URL of the target Web page and an XPath input box for the key content of the entry that can be dynamically added. Meanwhile, the interface also comprises a switch for judging whether the target system needs to log in, if the switch is turned on, a pre-login button can be clicked, at the moment, the system can open a new interface and jump to the target Web application, a user can log in at the interface, and the logged information can be recorded and stored in the configuration of the monitoring task. After the configuration is completed, the user may click on the "Add" button to add the configured monitoring task, which may then appear in the task list. After the monitoring task detects the change of the Web page and completes the positioning process of the key content, the column of the 'state' of the task becomes abnormal, and the monitoring result display interface can be popped up by clicking the state. The monitoring result display interface visually displays the mapping relation between the content blocks and the key content in the new and old version pages and XPath of the key content in the new version page.
Next, the effect of a key content locating method according to an embodiment of the present invention is verified using a specific example.
Firstly, the method of the embodiment of the invention detects the change of a plurality of actual representative Web application change examples and positions the key content, and the result shows that the method provided by the embodiment of the invention can detect the change of the Web page and positions the required key content in the changed Web page, thereby proving the effectiveness of the method. The accuracy of the change detection and content location processes of the methods herein were then verified on a greater number of page datasets covering 18 websites of the common Web page type, respectively. The result shows that the method provided by the embodiment of the invention has higher accuracy for detecting the change of the Web page and positioning the key content, and proves the accuracy of the method.
1. Example research-Chinese certain website X.
Taking the first page of a certain website X in China as an example, the processes of change detection, content block positioning and the like are subjected to example verification. The method of login processing, title identification, feature extraction and the like used in the processes provided by the embodiment of the invention can be involved.
Because the experimental Web page data are all from the Web Archive website, the pages captured by the website do not comprise pages which can be accessed after the Web system logs in. Therefore, in order to simulate a Web system to be logged in to verify the processing method for logging in proposed herein, the present example simulates a "login version" website X system that needs to be logged in to access the home page by using the website X login interface and the history data of the website X home page during the experimental design. The system comprises 17 first page data of the crawled websites X, and the first page data are identified by p_1 to p_17, wherein p_1 to p_16 are pages with the same version and different time, and p_17 is a page after being modified. Each access to the system will return the contents corresponding to p_1, p_2, …, p_17 in turn. The Web page change monitoring system realized by the embodiment of the invention is used for carrying out change monitoring and key content positioning on the simulated website X system.
Fig. 6a, 6b, and 6c are three examples comparing representative home pages of the chinese X website, corresponding to p_1, p_7, and p_17, respectively, wherein "network-aware dynamics" is the key content of the data to be acquired in this example, that is, the page. It can be seen that p_1 is similar in structure to p_7, but the specific number of dynamic entries is different, so p_7 has 4 more li nodes than p_1, as shown in FIG. 7a, as previously described, since embodiments of the present invention do not consider it to be a relevant change for such structural changes; the p_1 and p_17 have a relatively large structural difference, and the relative position of the key content "knowing net dynamics" in the page also changes, as shown in fig. 7b, the XPath information according to the "knowing net dynamics" in p_1 cannot be located to a specific content block in p_17, so that the page changes relatively. So, for this example, the desired result is that the system gives the CHANGED result after p_17 is acquired and checked for changes while the positioning of the content chunk and key content is done in p_17; before this, the result of the examination of the Web page should be NO DIFFERENCE.
Looking at the processing of system login first, the embodiment servers the login operation of the system by using an existing Y platform to generate a corresponding login interface; similarly, the Y platform is used for generating a data interface for the first page after the system is logged in, and the interface directly returns the complete page content. Through one call to the login interface, the related Session information is managed by the Y platform system, the page content can be directly acquired through subsequent call to the first page interface, and the Session information is updated at the same time, so that the corresponding page data can be ensured to be continuously acquired through the interface. The interface for login and home page is shown in fig. 7 c.
With the above interface for logging in and acquiring the content of the home page, the monitoring of the Web system page can be started. A monitoring task is registered in the system using both interfaces, which then initiates monitoring and locates critical content after a change is detected. It can be seen from the system log shown in fig. 7d that the system checked for page changes at the 17 th execution of the monitoring task and gave CHANGED results, which were as expected. The system then locates the content block and maps the key content in the changed page, and for this example, the title identification algorithm provided in the embodiment of the present invention can identify that the title of the content block is "network-aware" so that the content block containing the title is only checked in the page of the new version according to the title information, and the mapping relationship between the content block matching result and the key content calculated by the final system is shown in fig. 7 e. It can be seen that for this example, the method of the embodiment of the present invention can accurately locate key content in a page.
2. Example research-intellectual property bureau of certain province
The present example will take an intellectual property office system of a certain province as an example (an example shown in fig. 3), and focus on verifying the correctness of the title identification method in a special page structure and the validity in the change detection process according to the embodiment of the present invention.
In this example, two pages of different versions shown in fig. 3 and XPath of a specified key content are selected as inputs, and the corresponding node can be located according to the specified XPath in the pages of both versions, so that it is started to determine whether the situation of locating the error node exists. The method comprises the steps of firstly obtaining XPath of a content block as///x [ @ id= "newsrt1_1" ] through calculating the nearest public ancestor node, wherein in a page of an old version, the path corresponds to the content of 'city county dynamics'. In the new version of the first page, two tag pages of 'national file' and 'province file' are newly added. The XPath according to the above will locate the contents of the "province file" tab page in the new version page. The present example starts to try to identify the titles of the content blocks, and finds that the titles of the two content blocks are "city-county dynamic" and "province file" respectively, and the two titles are not identical, so that the detection result of CHANGED is correctly given, and this more specific page change is found, so as to prove the correctness of the title identification method provided by the embodiment of the present invention, and the effectiveness of the title identification method as an auxiliary means for change detection.
In the specific verification stage, the Web systems corresponding to 18 actual Y platform projects are selected. The example crawls historical version page data of the Web systems 2014-2018 through a WebArchive website, and the Web system comprises 2836 Web pages in total.
Firstly, experimental verification of a change detection process is carried out, 79 page pairs are selected from 2836 Web pages above to form 79 groups of test cases for change detection, and each group of test cases comprises Web page data of the same Web application at different times and XPath of a content block to be monitored. Of the 79 test cases, 56 groups did not undergo a correlation change, and the remaining 23 groups did. Table 2 is an evaluation index defined in this example for the method change detection process, and this example mainly examines the accuracy and recall of the detection.
Table 2 change detection evaluation index
Figure BDA0002304950020000161
Figure BDA0002304950020000171
TABLE 3 variation detection results
In fact there is a related variation Practically no related changes
Detecting a related change 23 2
Detecting no related change 0 54
Table 3 shows the results of the change detection on the 79 sets of test cases, and it can be seen that the accuracy of the Web page change detection method proposed herein reaches p=23/(23+2) =92%, and the recall rate is: r=23/(23+0) =100%.
Therefore, the detection method of the embodiment of the invention has higher accuracy for detecting the Web page change.
Followed by a process of locating the content block. The positioning result of the content block given in this example is a recommendation list of the content block, and therefore, two evaluation indexes shown in table 4 are selected at the time of effect verification of the content block positioning process. And the experimental data continue to adopt the pages of the 18 Web systems, a series of old version pages and a new version page are selected for each Web system to form a test case, and 18 groups of test cases are obtained in total. These test cases are passed as inputs to the content block locating process herein and statistics are made on the locating results output by the process, with the specific results shown in table 5.
Table 4 content block location evaluation index
Index name Meaning of index
Recommendation accuracy (P1) Ratio of target content block to top five of recommendation list
Optimal recommendation accuracy (P2) Ratio of target content block in first recommendation list
Table 5 content block location results
Number of recommendations 16
Optimum recommended number 14
According to experimental results, the recommendation accuracy of the content block positioning process in the example is that the recommendation accuracy is p1=16/18=88.9%; the best recommended accuracy is p2=14/18=77.8%. Therefore, the embodiment of the invention has better accuracy for the positioning process of the content block.
And finally, performing effect verification on the key content mapping process. 18 sets of new and old version page pairs are selected from the 18 Web systems, the positions of the content blocks in the new and old version pages are manually specified, and the information is transmitted as input to the key content mapping module. According to statistics, 349 key content items to be mapped are contained in the whole old version content blocks, and the accuracy of mapping is used as an evaluation index of the process. The final result shows that 319 key content items are mapped correctly by the key content mapping method of the embodiment, and the accuracy of mapping is 91.4%, which indicates that the method of the embodiment of the invention has higher accuracy for mapping key content.
In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described by differences from other embodiments, and identical and similar parts between the embodiments are all enough to be referred to each other.
The above description of a key content positioning method provided by the present invention has been provided in detail, and specific examples are applied to illustrate the principles and embodiments of the present invention, and the above examples are only used to help understand the method and core idea of the present invention; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in accordance with the ideas of the present invention, the present description should not be construed as limiting the present invention in view of the above.

Claims (10)

1. A key content location method, comprising:
regularly acquiring a Web page to be monitored;
positioning corresponding nodes in a Document Object Model (DOM) tree of the Web page according to key contents in preset monitoring configuration information; the monitoring configuration information is arranged in a configuration page, the configuration page is provided with a switch for whether to log in, and under the condition that the Web page needs to log in, the switch for logging in is opened to perform pre-login configuration;
when the corresponding node cannot be located, determining that the Web page changes;
when the corresponding node is positioned, a current content block containing key content in the monitoring configuration information is obtained from the Web page, the title of the current content block is compared with the title of an initial content block obtained from the Web page in the initializing monitoring task, and whether the Web page changes or not is determined;
storing the Web page after the change as historical version page data;
acquiring all historical content blocks of the historical version page data, and extracting structural features and text features from each historical content block;
integrating structural features and text features of all historical content blocks, sequentially positioning a historical target content block and key contents in the historical target content block, and determining a first feature of the historical target content block;
Aiming at the newly monitored Web page with changed new version, acquiring all subtrees of the Web page DOM tree with new version, taking each subtree as a content block to be matched, and extracting the characteristics of each content block to be matched;
traversing all the content blocks to be matched, carrying out similarity calculation on the characteristics of each content block to be matched and the first characteristics of the historical target content blocks, and positioning the current target content block;
extracting second characteristics of key contents in the historical target content block and characteristics of key contents in the current target content block;
and establishing a mapping relation between the key content and a new version page according to the second characteristic of the key content in the historical target content block and the editing distance of the corresponding DOM subtrees of the characteristic of the key content in the current target content block, and positioning the final key content in the current target content block.
2. The method of claim 1, wherein prior to periodically acquiring the Web page to be monitored, the method comprises:
receiving the monitoring configuration information of the Web page to be monitored, which is input by a user;
generating a monitoring task for monitoring the page change of the Web page to be monitored according to the monitoring configuration information;
The initialization monitoring task comprises the following steps:
acquiring initial page information of the Web page;
and according to the key content in the monitoring configuration information, obtaining an initial content block containing the key content in the monitoring configuration information in the initial page information.
3. The method according to claim 1 or 2, wherein the monitoring configuration information includes login information of a user and Cookie information for verifying the login information;
the step of receiving the monitoring configuration information of the Web page to be monitored, which is input by a user, comprises the following steps:
receiving login operation of a user for the Web page, and obtaining login information of the user;
the login information is sent to a server corresponding to the Web page;
receiving Cookie information returned by the server aiming at the login information and used for verifying the login information;
the step of periodically acquiring the Web page to be monitored comprises the following steps:
the Cookie information is sent to a server corresponding to the Web page together with a request I for acquiring the Web page in a HTTP request head mode at regular intervals;
and receiving the Web page returned by the server for the request.
4. The method of claim 1, wherein the monitoring configuration information includes hypertext markup language HTML code for the Web page and XML path language XPath corresponding to the current content block;
The title obtaining method of the current content block comprises the following steps:
analyzing the HTML code into a corresponding DOM tree;
extracting a current content block CB from the DOM tree according to the XPath corresponding to the current content block;
querying a list CBList of sibling nodes similar to the CB;
acquiring a subscript i of the CB in the CBList;
assigning the current content block CB to a circulation variable curNode, and starting circulation until the title of the current content block is found; the circulation method of the circulation variable comprises the following steps:
in each cycle, firstly taking out the text node TextNode at the leftmost part of the curNode as a candidate title node candidate, and obtaining text content in the candidate title node candidate;
judging whether the text content meets the condition of being the title of the current content block according to the title preset characteristics;
if yes, searching a sibling node list candates similar to candates, and returning text content of candates [ i ] as the title of the current content block;
if not, taking the father node of the current node in the DOM tree as the candidate, expanding the searching range and continuing to circulate.
5. The method of claim 1, wherein the step of comparing the title of the current content block with the title of the initial content block obtained from the Web page in the initialization monitoring task, and determining whether the Web page has changed comprises:
Comparing the title of the current content block with the title of the initial content block obtained from the Web page in an initialization monitoring task;
if the titles are not null and equal, determining that the Web page has no relevant change;
if the titles are not empty and are not equal, judging that the error node is positioned, and outputting a result of detecting the related change;
if the titles are empty, calculating the semantic similarity between the current content block and the initial content block and the structural similarity of the DOM subtree, comparing the semantic similarity with a preset semantic threshold, comparing the structural similarity with a preset structural threshold, and determining whether the Web page changes according to the comparison result.
6. The method of claim 5, wherein calculating a semantic similarity between the current content block and the initial content block, comparing the semantic similarity to a preset semantic threshold, and determining whether the Web page has changed based on the comparison result comprises:
extracting all text information in the current content block and the initial content block respectively;
calculating the similarity between all text information of the two content blocks;
Comparing the semantic similarity with a preset semantic threshold;
and when the semantic similarity is lower than a preset semantic threshold, determining that the Web page changes, and outputting a result of detecting the related change.
7. The method of claim 5, wherein the step of calculating the structural similarity of the current content block to the DOM subtree of the initial content block comprises:
sequentially arranging labels in respective DOM subtrees of the current content block and the initial content block into character strings;
and calculating the structural similarity according to the editing distance of the character strings between the two content blocks and the key content in the monitoring configuration information.
8. The method of claim 7, wherein the step of ordering tags in respective DOM subtrees of the current content block and the initial content block into strings further comprises:
sequentially arranging labels in respective DOM subtrees of the compressed current content block and the compressed initial content block into character strings;
wherein the step of compressing the content block comprises:
assigning a content block CB to a content block CCB subjected to list item compression, removing all child nodes in a child node list child of the CCB, and initializing a new child node list cchildren to be calculated into an empty list;
Performing depth-first traversal on the content block, and sequentially recursively performing list item compression operation on each child node to obtain a compressed child node cchild;
searching for whether a child node similar to the cchild structure exists in cchildren;
adding the compressed cchild to the cchildren when the child node similar to the cchild structure is not found;
assign the cchildren to be the child of the CCB and return to the CCB.
9. The method of claim 1, wherein the structural features are obtained from the structure of the DOM subtree corresponding to each historical content chunk, comprising:
a ratio of DOM sub-tree height to DOM tree height corresponding to the entire page;
the ratio of the number of nodes of the DOM subtree to the number of nodes of the DOM tree corresponding to the entire page;
ratios of picture nodes in DOM subtrees;
ratios of text nodes in the DOM subtrees;
ratios of hyperlink nodes in the DOM subtrees;
the text features are obtained from the HTML code of each historical content block, including:
a title of the content block;
text contained in the content block;
a ratio of text length contained by the content block to total text length of the page;
common prefixes of hyperlinks in content blocks.
10. The method of claim 1, wherein the second feature is obtained from a leaf node of a page DOM tree corresponding to key content within the historical target content block, comprising:
a label of the node;
the length of the node text;
a data pattern of the node text;
id and class attributes of a node.
CN201911236209.7A 2019-12-05 2019-12-05 Key content positioning method Active CN111079043B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911236209.7A CN111079043B (en) 2019-12-05 2019-12-05 Key content positioning method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911236209.7A CN111079043B (en) 2019-12-05 2019-12-05 Key content positioning method

Publications (2)

Publication Number Publication Date
CN111079043A CN111079043A (en) 2020-04-28
CN111079043B true CN111079043B (en) 2023-05-12

Family

ID=70313188

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911236209.7A Active CN111079043B (en) 2019-12-05 2019-12-05 Key content positioning method

Country Status (1)

Country Link
CN (1) CN111079043B (en)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111158973B (en) * 2019-12-05 2021-06-18 北京大学 Web application dynamic evolution monitoring method
CN113626028A (en) * 2020-05-07 2021-11-09 腾讯科技(深圳)有限公司 Page element mapping method and device
CN111741257B (en) * 2020-05-21 2022-01-28 深圳市商汤科技有限公司 Data processing method and device, electronic equipment and storage medium
CN111966880A (en) * 2020-08-17 2020-11-20 江苏百达智慧网络科技有限公司 Visual website content acquisition method and system
CN112417351B (en) * 2020-10-21 2022-08-19 上海哔哩哔哩科技有限公司 Method and device for determining visual track of user, computer equipment and storage medium
CN112799955B (en) * 2021-02-08 2023-09-26 腾讯科技(深圳)有限公司 Method and device for detecting model change, storage medium and electronic equipment
CN113177168B (en) * 2021-04-29 2023-12-01 上海云扩信息科技有限公司 Positioning method based on Web element attribute characteristics
CN115658993B (en) * 2022-09-27 2023-06-06 观澜网络(杭州)有限公司 Intelligent extraction method and system for core content of webpage
CN116112434B (en) * 2023-04-12 2023-06-09 深圳市网联天下科技有限公司 Router data intelligent caching method and system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8655913B1 (en) * 2012-03-26 2014-02-18 Google Inc. Method for locating web elements comprising of fuzzy matching on attributes and relative location/position of element
CN103607342A (en) * 2013-11-07 2014-02-26 北京奇虎科技有限公司 Mail content loading method and apparatus
CN109344355A (en) * 2018-09-26 2019-02-15 北京因特睿软件有限公司 Automatic returning detection and Block- matching adaptive approach and device for Web evolution

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101727486A (en) * 2009-12-04 2010-06-09 中国人民解放军信息工程大学 Web forum information extraction system
CN102270206A (en) * 2010-06-03 2011-12-07 北京迅捷英翔网络科技有限公司 Method and device for capturing valid web page contents
CN102890681B (en) * 2011-07-20 2016-03-09 阿里巴巴集团控股有限公司 A kind of method and system of generating web page stay in place form
CN103514203A (en) * 2012-06-27 2014-01-15 腾讯科技(深圳)有限公司 Method and system for browsing webpage in reading mode
GB2513168B (en) * 2013-04-18 2017-12-27 F Secure Corp Detecting unauthorised changes to website content
TWI570579B (en) * 2015-07-23 2017-02-11 葆光資訊有限公司 An information retrieving method utilizing webpage visual features and webpage language features and a system using thereof
CN109271477B (en) * 2018-09-05 2020-07-24 杭州数湾信息科技有限公司 Method and system for constructing classified corpus by means of Internet

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8655913B1 (en) * 2012-03-26 2014-02-18 Google Inc. Method for locating web elements comprising of fuzzy matching on attributes and relative location/position of element
CN103607342A (en) * 2013-11-07 2014-02-26 北京奇虎科技有限公司 Mail content loading method and apparatus
CN109344355A (en) * 2018-09-26 2019-02-15 北京因特睿软件有限公司 Automatic returning detection and Block- matching adaptive approach and device for Web evolution

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
D. C. Reis ; P. B. Golgher ; A. S. Silva ; A. F. Laender.Automatic Web News Extraction Using Tree Edit Distance.Proceedings of the 13th international conference on World Wild Web.2004,全文. *
YueKui Yang ; Yajun Du ; Yufeng Hai ; Zhaoqiong Gao.A Topic-Specific Web Crawler with Web Page Hierarchy Based on HTML Dom-Tree.2009 Aisa-Pacific Conference on Information Processing.2009,全文. *
李朝 ; 彭宏 ; 叶苏南 ; 张欢 ; 杨亲遥.基于DOM树的可适应性Web信息抽取.计算机科学.2009,第36卷(第7期),全文. *
王琦,唐世渭,杨冬青,王腾蛟.基于DOM的网页主题信息自动提取.计算机研究与发展.2004,(第10期),全文. *

Also Published As

Publication number Publication date
CN111079043A (en) 2020-04-28

Similar Documents

Publication Publication Date Title
CN111079043B (en) Key content positioning method
US9489401B1 (en) Methods and systems for object recognition
US8046681B2 (en) Techniques for inducing high quality structural templates for electronic documents
US7669119B1 (en) Correlation-based information extraction from markup language documents
US11769003B2 (en) Web element rediscovery system and method
US11550856B2 (en) Artificial intelligence for product data extraction
US20100169311A1 (en) Approaches for the unsupervised creation of structural templates for electronic documents
US20090125529A1 (en) Extracting information based on document structure and characteristics of attributes
Papadakis et al. Stavies: A system for information extraction from unknown web data sources through automatic web wrapper generation using clustering techniques
CN109325201A (en) Generation method, device, equipment and the storage medium of entity relationship data
US20050066271A1 (en) Extraction of information from structured documents
JP2010501096A (en) Cooperative optimization of wrapper generation and template detection
CN106960058B (en) Webpage structure change detection method and system
US11928140B2 (en) Methods and systems for modifying a search result
KR20130086631A (en) Related-word registration device, information processing device, related-word registration method, program for related-word registration device, recording medium, and related-word registration system
JP2007537515A (en) System and method for retrieving information and system and method for storing information
KR20190058141A (en) Method for generating data extracted from document and apparatus thereof
Di Lucca et al. Clone analysis in the web era: An approach to identify cloned web pages
CN111158973B (en) Web application dynamic evolution monitoring method
JP3832693B2 (en) Structured document search and display method and apparatus
CN104778232B (en) Searching result optimizing method and device based on long query
Liu et al. An automated algorithm for extracting website skeleton
Gkotsis et al. Self-supervised automated wrapper generation for weblog data extraction
JP5380874B2 (en) Information retrieval method, program and apparatus
CN115186240A (en) Social network user alignment method, device and medium based on relevance information

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant