CN109086450B

CN109086450B - Web deep network query interface detection method

Info

Publication number: CN109086450B
Application number: CN201810971193.3A
Authority: CN
Inventors: 于富财; 涂轶文; 章俊; 费高雷
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2018-08-24
Filing date: 2018-08-24
Publication date: 2021-08-27
Anticipated expiration: 2038-08-24
Also published as: CN109086450A

Abstract

The invention discloses a detection method of a Web deep network query interface, which comprises the following steps: s1, inputting a webpage URL link address; s2, rendering the webpage, and converting the display mode of the HTML visual label into a box model through a layout rendering engine; s3, layout and blocking are carried out; s4, carrying out block pruning; s5, block reconstruction is carried out; and S6, outputting an interactive interface. The method mainly utilizes the layout and style characteristics of the webpage data to carry out regional blocking on the webpage data, and finally realizes the positioning of the webpage interaction interface by formulating corresponding processing rules; the invention provides an improved method for combining structural features and text features of an interface, which solves the problem that the classification accuracy is low or the adaptability is not strong due to the fact that classification is carried out by depending on the structural features in a single way. In experimental tests, the webpage interaction interface positioning method achieves high positioning accuracy, and the improved interface classification characteristic set achieves a high classification effect.

Description

Web deep network query interface detection method

Technical Field

The invention relates to a detection method for a Web deep network query interface.

Background

With the rapid development of the internet, the internet has been closely related to the daily life of people. In order to meet the increasing living needs of people, the internet also obtains strong development power and presents more and more network information. The network scale study report in 2003 shows that: there are over 20 billion GB of data in a network and are in a continuously growing position. How to effectively mine and utilize the Web information is an important subject of Internet data mining.

According to the difficulty of obtaining internet data, webpage information can be divided into two categories: shallow Web (Surface Web) and Deep Web (Deep Web). Surface Web refers to the portion of Web information that can be retrieved by a conventional search engine, which is typically static and often embedded in other Web pages in the form of hyperlinks. Deep web information refers primarily to that portion of web page information that a conventional search engine cannot discover and retrieve. Generally speaking, deep mesh information mainly includes the following four aspects: first, information that exists in a Web Database (WDB) of a site page. The information is dynamically generated by a site background after a query form is filled in and a specific query is submitted to a site page; second, the Web pages to which the hyperlinks point are lacking. Because of lack of hyperlink direction, the traditional search engine can not index, and the Web page accounts for about 21.3% of the whole proportion according to statistics; third, restricted access to content. Such Web pages may not be accessible due to various policy specifications, or may require user registration rights to access; fourth, non-Web documents that are not accessible in the Web. Mainly comprises picture files, PDF files, Word documents and the like.

According to different methods for acquiring deep network information, deep network data mining can be divided into two forms:

(1) and integrating deep network data. This approach applies primarily to the related art of data integration.

Deep web data integration is mainly divided into three main stages: WDB inquiry interface integration phase, WDB inquiry submission phase and WDB inquiry result processing phase. WDB inquiry interface integration phase: through an improved traditional web crawler, a WDB query interface is found and obtained, then the query interface is understood and mode information is extracted, different WDB query interface modes from the same field are matched, and a unified integrated query interface in the same field is integrated. WDB query submission phase: the network user selects WDBs to be queried in the background by filling the integrated query interface, converts the filled query conditions into query conditions corresponding to the WDBs, and then submits directional queries to the selected WDBs respectively. WDB query result processing stage: analyzing the returned WDB query result page, extracting structured background data, performing semantic annotation on query result information, performing deduplication integration on all WDB query results, returning final results to a network user, and completing the deep network data query. Deep web data integration is a field-oriented extraction mode, and a user can query a plurality of databases in the same field only by filling in an integrated query interface in the whole process.

(2) And (5) superficial deep network information. This way of exploring deep web information mainly utilizes the infrastructure of traditional search engines. The main differences between superficial deep network information and deep network integration modes are as follows: the superficial filling of the query interface form is automatic, and manual participation is not needed; the shallow representation is to submit a query in advance and obtain a query result, and then index the query result to a static HTML page similar to the conventional search engine.

The discovery of the deep network query interface is a primary task of deep network information mining, and the accuracy and the coverage rate of the discovery of the query interface directly concern the effectiveness of subsequent processing. At present, the research on the extraction of the deep network query interface mode has been greatly developed: starting a research based on interface elements, gradually turning to a research based on element grouping relations; a gradual shift was made from rule-based research to machine learning technology-based research. In the deep network integration direction, the pattern extraction method based on element grouping is beneficial to subsequent pattern matching and integration; in the superficial deep web information direction, the element grouping-based pattern extraction method is favorable for understanding the form by the automatic form filling crawler, and particularly when constraint influence exists among interface elements, the grouping-based pattern extraction method is favorable for the crawler to lock a target search range, and is favorable for improving the hit rate of effective form query. The extraction research based on the machine learning technology is beneficial to improving the algorithm adaptability. How to effectively utilize design information presented by a Web page and provide a deep Web query interface mode extraction algorithm with stronger adaptability and more stability is a key for improving the accuracy and efficiency of mode extraction and is also a key and difficult point of current research.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provide the detection method of the Web deep Web inquiry interface, which utilizes the layout and style characteristics of the webpage data to perform regional blocking on the webpage data and finally realizes the positioning of the webpage interaction interface by formulating the corresponding processing rule and has high positioning accuracy.

The purpose of the invention is realized by the following technical scheme: a Web deep network query interface detection method comprises the following steps:

s1, inputting a webpage URL link address;

s2, rendering the webpage, and converting the display mode of the HTML visual label into a box model through a layout rendering engine;

s3, layout and blocking are carried out;

s4, carrying out block pruning;

s5, block reconstruction is carried out;

and S6, outputting an interactive interface.

Further, the rule of the layout partitioning performed in step S3 is:

(1) if the HTML element label corresponding to the block EB is < form >, the current block is not divided;

(2) if the interface element density of the block EB is lower than a preset threshold value, the current block needs to be divided; if not, further judging whether the area contained in the current block contains a plurality of interactive interface areas, if so, the current block needs to be divided, otherwise, the current block is not divided;

(3) if the sub-blocks with different background colors exist in the block EB, dividing the current block;

(4) if there is a delimiter sub-partition for partition EB, the current partition is divided with the delimiter as a boundary.

Further, the specific implementation method for the block pruning in the step S4 is as follows:

s41, selecting one block in the layout block result set, judging whether the block has an interface element, if so, keeping the block, and marking the block as processed; otherwise, executing step S42;

s42, judging whether the distance between the block and the bottom of the Web page is smaller than a preset threshold value, if so, deleting the block, otherwise, executing a step S43;

s43, judging whether the child node set of the DOM tree model corresponding to the block box model contains a text node, if so, executing a step S44, otherwise, deleting the block;

s44, judging whether the density of the graphic elements of the block is greater than a preset proportionality coefficient gamma 1, if so, deleting the block, otherwise, further judging whether the link density of the block is greater than a preset threshold gamma 2, if so, deleting the block, otherwise, keeping the block and marking the block as processed, wherein gamma 1 is greater than or equal to 0 and less than or equal to 1, and gamma 2 is greater than or equal to 0 and less than or equal to 1;

s45, checking whether unprocessed blocks exist in the layout block set, if yes, returning to the step S41, and otherwise, outputting all the block sets in the layout block set.

The invention has the beneficial effects that:

1. the invention provides a method for positioning a webpage interaction interface based on visual information of webpage design in an innovative manner on the positioning problem of a deep web query interface. The positioning method mainly utilizes the layout and style characteristics of the webpage data to perform region blocking on the webpage data, and finally realizes the positioning of the webpage interaction interface by formulating corresponding processing rules. The positioning method avoids the limitation that the prior method relies on < form > tags to carry out interface positioning.

2. On the aspect of the deep network query interface identification problem, an improved method for combining the structural features and the text features of the interface is provided, and the problem that the classification accuracy is low or the adaptability is not strong due to the fact that the classification is carried out by depending on the structural features in a single way is improved. In experimental tests, the webpage interaction interface positioning method achieves high positioning accuracy, and the improved interface classification characteristic set achieves a high classification effect.

Drawings

FIG. 1 is a flow chart of a Web deep Web query interface detection method of the present invention;

FIG. 2 is a schematic diagram of an element box model;

FIG. 3 is a flow chart of the present invention for performing block pruning.

Detailed Description

The invention aims at the problems found by the deep web inquiry interface, and mainly researches two sub-problems of webpage interaction interface positioning and deep web inquiry interface identification. The invention provides a deep web query interface positioning method based on visual information, which can effectively avoid the problem that the conventional positioning method depends on a < form > tag. Then, based on a webpage interactive interface positioning method, the deep web query interface identification problem is subjected to extended research, and the structure and text characteristics of the webpage interactive interface are combined for query interface identification, so that the problem of low classification identification accuracy caused by only using the interface structure characteristics is solved. The technical scheme of the invention is further explained by combining the attached drawings.

As shown in fig. 1, a method for detecting a Web deep Web query interface includes the following steps:

s1, inputting a webpage URL link address;

(1) element Box Model (Element Box Model): the element box model refers to the way the HTML elements defined by CSS styles are rendered. After being rendered in browser mode, each visible HTML element is characterized in its specific presentation form by a box model as shown in FIG. 2. Wherein, Content represents the concrete Content of HTML element; padding represents the inner margin, clings to the content part, and mainly presents background information of elements; border represents the Border of the inner edge distance; margin represents the Margin, and is transparent by default.

According to the design specification of W3C, after the HTML web page passes through the browser layout rendering engine, each HTML visual tag is arranged in the page in a box model manner, and the actual web page seen by the user is finally obtained.

(2) Interaction Interface Element (Interface Element): an interactive interface element as defined herein refers to a class of HTML elements that characterize interface information. In particular to a set of elements of the type: ES { input, radio, checkbox, text, select, textarea, button }, where radio, checkbox, and text are specific types of < input > elements, which are individually mentioned to illustrate their importance. Generally, the web page interactive interface has at least one of the elements in the ES set. The set of interface elements ES is actually a subset of the web page interaction interface elements. The method selects a group of elements common to most interfaces as a positioning reference, and can reduce the complexity of the problem.

(3) Interface Area (Interface Area): the interactive interface region refers to a rectangular region corresponding to a minimum box model containing all interactive interface elements. In particular, for form, it is the rectangular area where the < form > tag element box model is located.

The problem to be solved by web page interactive interface positioning is to find the interactive interface area. This stems from the rules of web page design: for nested HTML tags, it is also generally presented in the form of a nested box model in the web page layout. The web page rendering step is the first step of the positioning method. Firstly, inputting a webpage URL link address, then converting the display mode of the HTML visual label into a box model through a layout rendering engine, and processing in the subsequent steps is based on the obtained box model.

S3, layout and blocking are carried out; the partitioning method of the layout blocks provided by the invention mainly utilizes the style and the layout characteristics of the blocks. The method judges whether the sub-blocks need to be divided according to the association degree of the sub-blocks contained in the sub-blocks and the characteristics of the sub-blocks. The heuristic rules to which the partitioning applies will be described below:

rule (1) is a description of the conventional < form > tag-based positioning method. The purpose of partitioning is to find elements belonging to the same interactive interface, and obviously, the traditional < form > tag-based positioning method is a webpage interactive interface to a certain extent, so that continuous partitioning is not needed.

rule (2) mainly utilizes the layout characteristics of the web page interactive interface, and usually for an interactive interface block, the interface element sub-blocks contained therein should occupy a considerable proportion. If the density of the interactive interface elements of the block is low, the side shows that the interface may contain more interference information (non-interactive interface information), and the interface needs to be divided, so that the influence of the non-interactive interface information on the identification and classification of the subsequent deep network query interface is reduced, and the experimental error is reduced. If the density of the interactive interface elements of the block is high, further judgment is needed to judge whether the area contained in the block contains a plurality of interactive interface areas.

Rule (3) is based on a web page design observation: webpage blocks have different background colors, which usually means different semantics. When designing a web page, a web page designer will generally distinguish the semantic blocks of the web page in this way so that a user can intuitively locate the content of interest. Rule (4) is similar to rule (3), and usually the separator also indicates that the blocks on both sides of the separator have different semantics. Through the rules (3) and (4), whether the current block contains a plurality of semantic blocks can be judged from the visual angle, and the problem of dividing a plurality of densely arranged interaction interface areas can be solved to a certain extent.

S4, carrying out block pruning; by partitioning the web page layout, the web page can be subjected to preliminary region division to obtain a plurality of visible regions. The invention aims to find potential areas of a deep web query interface, namely areas of a web page interaction interface needing to be positioned and identified from a plurality of data areas. Therefore, for the method, all other irrelevant data areas are noise, and deletion and denoising are required, which is beneficial to improving the processing efficiency.

Through extensive experimental observation and analysis, it has been found that the web page interaction interface region generally has the following characteristics:

(1) the Web page interaction interface area is not located at the bottom of the Web page.

(2) The web page interaction interface area does not have a large number of pictures and hyperlinks.

As shown in fig. 3, based on the above observation facts, the following pruning methods are proposed:

if the block does not contain the interactive interface element, the interactive interface element cannot be the interactive interface area intuitively, but actually can be a part of the interactive interface area, the interactive interface element has the function of performing semantic annotation on the related interface element in the interactive interface area to provide semantic information for the interface, and the information has important significance for the identification and classification of the subsequent deep web query interface and needs to be reserved. Through a large number of experimental observations, the interface semantic information is usually presented in a text node mode, and then whether the blocks provide the semantic information can be judged by judging whether the blocks have the text nodes or not. Furthermore, even if the blocks have text nodes, the blocks are not all interactive interface semantic information, and the possibility that the blocks may be the text and navigation areas of a certain type of webpage can be eliminated by measuring the graphic element density and the link density of the blocks.

Specifically, the flow of the block pruning method is shown in fig. 3. Firstly, judging whether unprocessed blocks exist according to a layout block result set, and if the unprocessed blocks do not exist, outputting a residual block set; if unprocessed blocks exist, selecting an unprocessed block EB from the result set to see whether the unprocessed block EB has an interface element, if so, keeping the unprocessed block EB for a long time and marking the unprocessed block EB as processed, if not, further checking whether the distance between the unprocessed block EB and the bottom of the page is less than a threshold value, and if so, directly deleting the unprocessed block EB; if the number is larger than or equal to the threshold value, checking whether the block contains a text node, if not, deleting the block, and if the text node is contained, further, if the graphic element density of the block EB is larger than gamma 1 or the link density of the block EB is larger than gamma 2, deleting the block.

S5, block reconstruction is carried out;

when all the blocks are marked as processed, the division and pruning of all the blocks is finished. Next, block reconstruction is required, and has two purposes:

first, to correct the problem of excessive partitioning at the layout partitioning stage. In the stage of block layout and partitioning, whether the partitioning needs to be continuously divided is judged through heuristic rules, however, due to the heterogeneous and unstructured characteristics of the problem, the rules cannot cover all situations, and further processing is needed to achieve the purpose of reducing errors. In particular, if the interface element density of the corresponding partition of the interactive interface region does not meet the requirement, the interactive interface region may be further divided, resulting in erroneous division. In order to solve the problem that interactive interface information is incomplete due to excessive partitioning, merging and reconstructing blocks with close relations are needed, and finally, a webpage interactive interface area is located.

Secondly, in order to further screen the partitions not containing the interactive interface elements, the partitions more likely to have the interactive interface area information are selected.

The block reconstruction mainly utilizes the visual design characteristics among blocks, and when a webpage designer designs a webpage, the webpage data area is usually divided by visual characteristics, so that the visual design characteristics provide a good guiding function for Web data mining.

And S6, outputting an interactive interface.

It will be appreciated by those of ordinary skill in the art that the embodiments described herein are intended to assist the reader in understanding the principles of the invention and are to be construed as being without limitation to such specifically recited embodiments and examples. Those skilled in the art can make various other specific changes and combinations based on the teachings of the present invention without departing from the spirit of the invention, and these changes and combinations are within the scope of the invention.

Claims

1. A Web deep network query interface detection method is characterized by comprising the following steps:

s1, inputting a webpage URL link address;

s3, layout and blocking are carried out;

s4, carrying out block pruning; the specific implementation method comprises the following steps:

s45, checking whether unprocessed blocks exist in the layout block set, if so, returning to the step S41, otherwise, outputting all block sets in the layout block set;

s5, block reconstruction is carried out;

and S6, outputting an interactive interface.

2. The method for detecting the Web deep Web query interface as claimed in claim 1, wherein the rule of the step S3 for layout blocking is: