CN109086450B - Web deep network query interface detection method - Google Patents

Web deep network query interface detection method Download PDF

Info

Publication number
CN109086450B
CN109086450B CN201810971193.3A CN201810971193A CN109086450B CN 109086450 B CN109086450 B CN 109086450B CN 201810971193 A CN201810971193 A CN 201810971193A CN 109086450 B CN109086450 B CN 109086450B
Authority
CN
China
Prior art keywords
block
interface
layout
webpage
web
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810971193.3A
Other languages
Chinese (zh)
Other versions
CN109086450A (en
Inventor
于富财
涂轶文
章俊
费高雷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Original Assignee
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China filed Critical University of Electronic Science and Technology of China
Priority to CN201810971193.3A priority Critical patent/CN109086450B/en
Publication of CN109086450A publication Critical patent/CN109086450A/en
Application granted granted Critical
Publication of CN109086450B publication Critical patent/CN109086450B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/70Software maintenance or management
    • G06F8/72Code refactoring

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a detection method of a Web deep network query interface, which comprises the following steps: s1, inputting a webpage URL link address; s2, rendering the webpage, and converting the display mode of the HTML visual label into a box model through a layout rendering engine; s3, layout and blocking are carried out; s4, carrying out block pruning; s5, block reconstruction is carried out; and S6, outputting an interactive interface. The method mainly utilizes the layout and style characteristics of the webpage data to carry out regional blocking on the webpage data, and finally realizes the positioning of the webpage interaction interface by formulating corresponding processing rules; the invention provides an improved method for combining structural features and text features of an interface, which solves the problem that the classification accuracy is low or the adaptability is not strong due to the fact that classification is carried out by depending on the structural features in a single way. In experimental tests, the webpage interaction interface positioning method achieves high positioning accuracy, and the improved interface classification characteristic set achieves a high classification effect.

Description

Web deep network query interface detection method
Technical Field
The invention relates to a detection method for a Web deep network query interface.
Background
With the rapid development of the internet, the internet has been closely related to the daily life of people. In order to meet the increasing living needs of people, the internet also obtains strong development power and presents more and more network information. The network scale study report in 2003 shows that: there are over 20 billion GB of data in a network and are in a continuously growing position. How to effectively mine and utilize the Web information is an important subject of Internet data mining.
According to the difficulty of obtaining internet data, webpage information can be divided into two categories: shallow Web (Surface Web) and Deep Web (Deep Web). Surface Web refers to the portion of Web information that can be retrieved by a conventional search engine, which is typically static and often embedded in other Web pages in the form of hyperlinks. Deep web information refers primarily to that portion of web page information that a conventional search engine cannot discover and retrieve. Generally speaking, deep mesh information mainly includes the following four aspects: first, information that exists in a Web Database (WDB) of a site page. The information is dynamically generated by a site background after a query form is filled in and a specific query is submitted to a site page; second, the Web pages to which the hyperlinks point are lacking. Because of lack of hyperlink direction, the traditional search engine can not index, and the Web page accounts for about 21.3% of the whole proportion according to statistics; third, restricted access to content. Such Web pages may not be accessible due to various policy specifications, or may require user registration rights to access; fourth, non-Web documents that are not accessible in the Web. Mainly comprises picture files, PDF files, Word documents and the like.
According to different methods for acquiring deep network information, deep network data mining can be divided into two forms:
(1) and integrating deep network data. This approach applies primarily to the related art of data integration.
Deep web data integration is mainly divided into three main stages: WDB inquiry interface integration phase, WDB inquiry submission phase and WDB inquiry result processing phase. WDB inquiry interface integration phase: through an improved traditional web crawler, a WDB query interface is found and obtained, then the query interface is understood and mode information is extracted, different WDB query interface modes from the same field are matched, and a unified integrated query interface in the same field is integrated. WDB query submission phase: the network user selects WDBs to be queried in the background by filling the integrated query interface, converts the filled query conditions into query conditions corresponding to the WDBs, and then submits directional queries to the selected WDBs respectively. WDB query result processing stage: analyzing the returned WDB query result page, extracting structured background data, performing semantic annotation on query result information, performing deduplication integration on all WDB query results, returning final results to a network user, and completing the deep network data query. Deep web data integration is a field-oriented extraction mode, and a user can query a plurality of databases in the same field only by filling in an integrated query interface in the whole process.
(2) And (5) superficial deep network information. This way of exploring deep web information mainly utilizes the infrastructure of traditional search engines. The main differences between superficial deep network information and deep network integration modes are as follows: the superficial filling of the query interface form is automatic, and manual participation is not needed; the shallow representation is to submit a query in advance and obtain a query result, and then index the query result to a static HTML page similar to the conventional search engine.
The discovery of the deep network query interface is a primary task of deep network information mining, and the accuracy and the coverage rate of the discovery of the query interface directly concern the effectiveness of subsequent processing. At present, the research on the extraction of the deep network query interface mode has been greatly developed: starting a research based on interface elements, gradually turning to a research based on element grouping relations; a gradual shift was made from rule-based research to machine learning technology-based research. In the deep network integration direction, the pattern extraction method based on element grouping is beneficial to subsequent pattern matching and integration; in the superficial deep web information direction, the element grouping-based pattern extraction method is favorable for understanding the form by the automatic form filling crawler, and particularly when constraint influence exists among interface elements, the grouping-based pattern extraction method is favorable for the crawler to lock a target search range, and is favorable for improving the hit rate of effective form query. The extraction research based on the machine learning technology is beneficial to improving the algorithm adaptability. How to effectively utilize design information presented by a Web page and provide a deep Web query interface mode extraction algorithm with stronger adaptability and more stability is a key for improving the accuracy and efficiency of mode extraction and is also a key and difficult point of current research.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provide the detection method of the Web deep Web inquiry interface, which utilizes the layout and style characteristics of the webpage data to perform regional blocking on the webpage data and finally realizes the positioning of the webpage interaction interface by formulating the corresponding processing rule and has high positioning accuracy.
The purpose of the invention is realized by the following technical scheme: a Web deep network query interface detection method comprises the following steps:
s1, inputting a webpage URL link address;
s2, rendering the webpage, and converting the display mode of the HTML visual label into a box model through a layout rendering engine;
s3, layout and blocking are carried out;
s4, carrying out block pruning;
s5, block reconstruction is carried out;
and S6, outputting an interactive interface.
Further, the rule of the layout partitioning performed in step S3 is:
(1) if the HTML element label corresponding to the block EB is < form >, the current block is not divided;
(2) if the interface element density of the block EB is lower than a preset threshold value, the current block needs to be divided; if not, further judging whether the area contained in the current block contains a plurality of interactive interface areas, if so, the current block needs to be divided, otherwise, the current block is not divided;
(3) if the sub-blocks with different background colors exist in the block EB, dividing the current block;
(4) if there is a delimiter sub-partition for partition EB, the current partition is divided with the delimiter as a boundary.
Further, the specific implementation method for the block pruning in the step S4 is as follows:
s41, selecting one block in the layout block result set, judging whether the block has an interface element, if so, keeping the block, and marking the block as processed; otherwise, executing step S42;
s42, judging whether the distance between the block and the bottom of the Web page is smaller than a preset threshold value, if so, deleting the block, otherwise, executing a step S43;
s43, judging whether the child node set of the DOM tree model corresponding to the block box model contains a text node, if so, executing a step S44, otherwise, deleting the block;
s44, judging whether the density of the graphic elements of the block is greater than a preset proportionality coefficient gamma 1, if so, deleting the block, otherwise, further judging whether the link density of the block is greater than a preset threshold gamma 2, if so, deleting the block, otherwise, keeping the block and marking the block as processed, wherein gamma 1 is greater than or equal to 0 and less than or equal to 1, and gamma 2 is greater than or equal to 0 and less than or equal to 1;
s45, checking whether unprocessed blocks exist in the layout block set, if yes, returning to the step S41, and otherwise, outputting all the block sets in the layout block set.
The invention has the beneficial effects that:
1. the invention provides a method for positioning a webpage interaction interface based on visual information of webpage design in an innovative manner on the positioning problem of a deep web query interface. The positioning method mainly utilizes the layout and style characteristics of the webpage data to perform region blocking on the webpage data, and finally realizes the positioning of the webpage interaction interface by formulating corresponding processing rules. The positioning method avoids the limitation that the prior method relies on < form > tags to carry out interface positioning.
2. On the aspect of the deep network query interface identification problem, an improved method for combining the structural features and the text features of the interface is provided, and the problem that the classification accuracy is low or the adaptability is not strong due to the fact that the classification is carried out by depending on the structural features in a single way is improved. In experimental tests, the webpage interaction interface positioning method achieves high positioning accuracy, and the improved interface classification characteristic set achieves a high classification effect.
Drawings
FIG. 1 is a flow chart of a Web deep Web query interface detection method of the present invention;
FIG. 2 is a schematic diagram of an element box model;
FIG. 3 is a flow chart of the present invention for performing block pruning.
Detailed Description
The invention aims at the problems found by the deep web inquiry interface, and mainly researches two sub-problems of webpage interaction interface positioning and deep web inquiry interface identification. The invention provides a deep web query interface positioning method based on visual information, which can effectively avoid the problem that the conventional positioning method depends on a < form > tag. Then, based on a webpage interactive interface positioning method, the deep web query interface identification problem is subjected to extended research, and the structure and text characteristics of the webpage interactive interface are combined for query interface identification, so that the problem of low classification identification accuracy caused by only using the interface structure characteristics is solved. The technical scheme of the invention is further explained by combining the attached drawings.
As shown in fig. 1, a method for detecting a Web deep Web query interface includes the following steps:
s1, inputting a webpage URL link address;
s2, rendering the webpage, and converting the display mode of the HTML visual label into a box model through a layout rendering engine;
(1) element Box Model (Element Box Model): the element box model refers to the way the HTML elements defined by CSS styles are rendered. After being rendered in browser mode, each visible HTML element is characterized in its specific presentation form by a box model as shown in FIG. 2. Wherein, Content represents the concrete Content of HTML element; padding represents the inner margin, clings to the content part, and mainly presents background information of elements; border represents the Border of the inner edge distance; margin represents the Margin, and is transparent by default.
According to the design specification of W3C, after the HTML web page passes through the browser layout rendering engine, each HTML visual tag is arranged in the page in a box model manner, and the actual web page seen by the user is finally obtained.
(2) Interaction Interface Element (Interface Element): an interactive interface element as defined herein refers to a class of HTML elements that characterize interface information. In particular to a set of elements of the type: ES { input, radio, checkbox, text, select, textarea, button }, where radio, checkbox, and text are specific types of < input > elements, which are individually mentioned to illustrate their importance. Generally, the web page interactive interface has at least one of the elements in the ES set. The set of interface elements ES is actually a subset of the web page interaction interface elements. The method selects a group of elements common to most interfaces as a positioning reference, and can reduce the complexity of the problem.
(3) Interface Area (Interface Area): the interactive interface region refers to a rectangular region corresponding to a minimum box model containing all interactive interface elements. In particular, for form, it is the rectangular area where the < form > tag element box model is located.
The problem to be solved by web page interactive interface positioning is to find the interactive interface area. This stems from the rules of web page design: for nested HTML tags, it is also generally presented in the form of a nested box model in the web page layout. The web page rendering step is the first step of the positioning method. Firstly, inputting a webpage URL link address, then converting the display mode of the HTML visual label into a box model through a layout rendering engine, and processing in the subsequent steps is based on the obtained box model.
S3, layout and blocking are carried out; the partitioning method of the layout blocks provided by the invention mainly utilizes the style and the layout characteristics of the blocks. The method judges whether the sub-blocks need to be divided according to the association degree of the sub-blocks contained in the sub-blocks and the characteristics of the sub-blocks. The heuristic rules to which the partitioning applies will be described below:
(1) if the HTML element label corresponding to the block EB is < form >, the current block is not divided;
rule (1) is a description of the conventional < form > tag-based positioning method. The purpose of partitioning is to find elements belonging to the same interactive interface, and obviously, the traditional < form > tag-based positioning method is a webpage interactive interface to a certain extent, so that continuous partitioning is not needed.
(2) If the interface element density of the block EB is lower than a preset threshold value, the current block needs to be divided; if not, further judging whether the area contained in the current block contains a plurality of interactive interface areas, if so, the current block needs to be divided, otherwise, the current block is not divided;
rule (2) mainly utilizes the layout characteristics of the web page interactive interface, and usually for an interactive interface block, the interface element sub-blocks contained therein should occupy a considerable proportion. If the density of the interactive interface elements of the block is low, the side shows that the interface may contain more interference information (non-interactive interface information), and the interface needs to be divided, so that the influence of the non-interactive interface information on the identification and classification of the subsequent deep network query interface is reduced, and the experimental error is reduced. If the density of the interactive interface elements of the block is high, further judgment is needed to judge whether the area contained in the block contains a plurality of interactive interface areas.
(3) If the sub-blocks with different background colors exist in the block EB, dividing the current block;
(4) if there is a delimiter sub-partition for partition EB, the current partition is divided with the delimiter as a boundary.
Rule (3) is based on a web page design observation: webpage blocks have different background colors, which usually means different semantics. When designing a web page, a web page designer will generally distinguish the semantic blocks of the web page in this way so that a user can intuitively locate the content of interest. Rule (4) is similar to rule (3), and usually the separator also indicates that the blocks on both sides of the separator have different semantics. Through the rules (3) and (4), whether the current block contains a plurality of semantic blocks can be judged from the visual angle, and the problem of dividing a plurality of densely arranged interaction interface areas can be solved to a certain extent.
S4, carrying out block pruning; by partitioning the web page layout, the web page can be subjected to preliminary region division to obtain a plurality of visible regions. The invention aims to find potential areas of a deep web query interface, namely areas of a web page interaction interface needing to be positioned and identified from a plurality of data areas. Therefore, for the method, all other irrelevant data areas are noise, and deletion and denoising are required, which is beneficial to improving the processing efficiency.
Through extensive experimental observation and analysis, it has been found that the web page interaction interface region generally has the following characteristics:
(1) the Web page interaction interface area is not located at the bottom of the Web page.
(2) The web page interaction interface area does not have a large number of pictures and hyperlinks.
As shown in fig. 3, based on the above observation facts, the following pruning methods are proposed:
s41, selecting one block in the layout block result set, judging whether the block has an interface element, if so, keeping the block, and marking the block as processed; otherwise, executing step S42;
s42, judging whether the distance between the block and the bottom of the Web page is smaller than a preset threshold value, if so, deleting the block, otherwise, executing a step S43;
s43, judging whether the child node set of the DOM tree model corresponding to the block box model contains a text node, if so, executing a step S44, otherwise, deleting the block;
s44, judging whether the density of the graphic elements of the block is greater than a preset proportionality coefficient gamma 1, if so, deleting the block, otherwise, further judging whether the link density of the block is greater than a preset threshold gamma 2, if so, deleting the block, otherwise, keeping the block and marking the block as processed, wherein gamma 1 is greater than or equal to 0 and less than or equal to 1, and gamma 2 is greater than or equal to 0 and less than or equal to 1;
if the block does not contain the interactive interface element, the interactive interface element cannot be the interactive interface area intuitively, but actually can be a part of the interactive interface area, the interactive interface element has the function of performing semantic annotation on the related interface element in the interactive interface area to provide semantic information for the interface, and the information has important significance for the identification and classification of the subsequent deep web query interface and needs to be reserved. Through a large number of experimental observations, the interface semantic information is usually presented in a text node mode, and then whether the blocks provide the semantic information can be judged by judging whether the blocks have the text nodes or not. Furthermore, even if the blocks have text nodes, the blocks are not all interactive interface semantic information, and the possibility that the blocks may be the text and navigation areas of a certain type of webpage can be eliminated by measuring the graphic element density and the link density of the blocks.
Specifically, the flow of the block pruning method is shown in fig. 3. Firstly, judging whether unprocessed blocks exist according to a layout block result set, and if the unprocessed blocks do not exist, outputting a residual block set; if unprocessed blocks exist, selecting an unprocessed block EB from the result set to see whether the unprocessed block EB has an interface element, if so, keeping the unprocessed block EB for a long time and marking the unprocessed block EB as processed, if not, further checking whether the distance between the unprocessed block EB and the bottom of the page is less than a threshold value, and if so, directly deleting the unprocessed block EB; if the number is larger than or equal to the threshold value, checking whether the block contains a text node, if not, deleting the block, and if the text node is contained, further, if the graphic element density of the block EB is larger than gamma 1 or the link density of the block EB is larger than gamma 2, deleting the block.
S45, checking whether unprocessed blocks exist in the layout block set, if yes, returning to the step S41, and otherwise, outputting all the block sets in the layout block set.
S5, block reconstruction is carried out;
when all the blocks are marked as processed, the division and pruning of all the blocks is finished. Next, block reconstruction is required, and has two purposes:
first, to correct the problem of excessive partitioning at the layout partitioning stage. In the stage of block layout and partitioning, whether the partitioning needs to be continuously divided is judged through heuristic rules, however, due to the heterogeneous and unstructured characteristics of the problem, the rules cannot cover all situations, and further processing is needed to achieve the purpose of reducing errors. In particular, if the interface element density of the corresponding partition of the interactive interface region does not meet the requirement, the interactive interface region may be further divided, resulting in erroneous division. In order to solve the problem that interactive interface information is incomplete due to excessive partitioning, merging and reconstructing blocks with close relations are needed, and finally, a webpage interactive interface area is located.
Secondly, in order to further screen the partitions not containing the interactive interface elements, the partitions more likely to have the interactive interface area information are selected.
The block reconstruction mainly utilizes the visual design characteristics among blocks, and when a webpage designer designs a webpage, the webpage data area is usually divided by visual characteristics, so that the visual design characteristics provide a good guiding function for Web data mining.
And S6, outputting an interactive interface.
It will be appreciated by those of ordinary skill in the art that the embodiments described herein are intended to assist the reader in understanding the principles of the invention and are to be construed as being without limitation to such specifically recited embodiments and examples. Those skilled in the art can make various other specific changes and combinations based on the teachings of the present invention without departing from the spirit of the invention, and these changes and combinations are within the scope of the invention.

Claims (2)

1. A Web deep network query interface detection method is characterized by comprising the following steps:
s1, inputting a webpage URL link address;
s2, rendering the webpage, and converting the display mode of the HTML visual label into a box model through a layout rendering engine;
s3, layout and blocking are carried out;
s4, carrying out block pruning; the specific implementation method comprises the following steps:
s41, selecting one block in the layout block result set, judging whether the block has an interface element, if so, keeping the block, and marking the block as processed; otherwise, executing step S42;
s42, judging whether the distance between the block and the bottom of the Web page is smaller than a preset threshold value, if so, deleting the block, otherwise, executing a step S43;
s43, judging whether the child node set of the DOM tree model corresponding to the block box model contains a text node, if so, executing a step S44, otherwise, deleting the block;
s44, judging whether the density of the graphic elements of the block is greater than a preset proportionality coefficient gamma 1, if so, deleting the block, otherwise, further judging whether the link density of the block is greater than a preset threshold gamma 2, if so, deleting the block, otherwise, keeping the block and marking the block as processed, wherein gamma 1 is greater than or equal to 0 and less than or equal to 1, and gamma 2 is greater than or equal to 0 and less than or equal to 1;
s45, checking whether unprocessed blocks exist in the layout block set, if so, returning to the step S41, otherwise, outputting all block sets in the layout block set;
s5, block reconstruction is carried out;
and S6, outputting an interactive interface.
2. The method for detecting the Web deep Web query interface as claimed in claim 1, wherein the rule of the step S3 for layout blocking is:
(1) if the HTML element label corresponding to the block EB is < form >, the current block is not divided;
(2) if the interface element density of the block EB is lower than a preset threshold value, the current block needs to be divided; if not, further judging whether the area contained in the current block contains a plurality of interactive interface areas, if so, the current block needs to be divided, otherwise, the current block is not divided;
(3) if the sub-blocks with different background colors exist in the block EB, dividing the current block;
(4) if there is a delimiter sub-partition for partition EB, the current partition is divided with the delimiter as a boundary.
CN201810971193.3A 2018-08-24 2018-08-24 Web deep network query interface detection method Active CN109086450B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810971193.3A CN109086450B (en) 2018-08-24 2018-08-24 Web deep network query interface detection method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810971193.3A CN109086450B (en) 2018-08-24 2018-08-24 Web deep network query interface detection method

Publications (2)

Publication Number Publication Date
CN109086450A CN109086450A (en) 2018-12-25
CN109086450B true CN109086450B (en) 2021-08-27

Family

ID=64794531

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810971193.3A Active CN109086450B (en) 2018-08-24 2018-08-24 Web deep network query interface detection method

Country Status (1)

Country Link
CN (1) CN109086450B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11914598B2 (en) * 2022-05-27 2024-02-27 Sap Se Extended synopsis pruning in database management systems

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101667201A (en) * 2009-09-18 2010-03-10 浙江大学 Integration method of Deep Web query interface based on tree merging
CN102135976A (en) * 2010-09-27 2011-07-27 华为技术有限公司 Hypertext markup language page structured data extraction method and device
CN103092913A (en) * 2012-11-29 2013-05-08 江苏瑞中数据股份有限公司 Method for judging deep web query interface by adopting iterative naive Bayesian classifier

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7552116B2 (en) * 2004-08-06 2009-06-23 The Board Of Trustees Of The University Of Illinois Method and system for extracting web query interfaces
US20090077180A1 (en) * 2007-09-14 2009-03-19 Flowers John S Novel systems and methods for transmitting syntactically accurate messages over a network
CN101419625B (en) * 2008-12-02 2012-11-28 西安交通大学 Deep web self-adapting crawling method based on minimum searchable mode
US20110313995A1 (en) * 2010-06-18 2011-12-22 Abraham Lederman Browser based multilingual federated search
CN103678490B (en) * 2013-11-14 2017-01-11 桂林电子科技大学 Deep Web query interface clustering method based on Hadoop platform

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101667201A (en) * 2009-09-18 2010-03-10 浙江大学 Integration method of Deep Web query interface based on tree merging
CN102135976A (en) * 2010-09-27 2011-07-27 华为技术有限公司 Hypertext markup language page structured data extraction method and device
CN103092913A (en) * 2012-11-29 2013-05-08 江苏瑞中数据股份有限公司 Method for judging deep web query interface by adopting iterative naive Bayesian classifier

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
"基于Deep Web检索的查询结果处理技术的应用";周二虎 等;《计算机工程与设计》;20100131;第31卷(第1期);第106-109页 *

Also Published As

Publication number Publication date
CN109086450A (en) 2018-12-25

Similar Documents

Publication Publication Date Title
US9535810B1 (en) Layout optimization
US8719291B2 (en) Information extraction using spatial reasoning on the CSS2 visual box model
Sanoja et al. Block-o-matic: A web page segmentation framework
Liu et al. Vide: A vision-based approach for deep web data extraction
CN103823824B (en) A kind of method and system that text classification corpus is built automatically by the Internet
US7904455B2 (en) Cascading cluster collages: visualization of image search results on small displays
US20090248707A1 (en) Site-specific information-type detection methods and systems
US20130339840A1 (en) System and method for logical chunking and restructuring websites
CN103473338B (en) Webpage content extraction method and webpage content extraction system
CN103955529A (en) Internet information searching and aggregating presentation method
EP3848821A1 (en) Evaluating xml full text search
JPWO2007105759A1 (en) Formula description structured language object search system and search method
US20160103913A1 (en) Method and system for calculating a degree of linkage for webpages
CN106503211A (en) Information issues the method that the mobile edition of class website is automatically generated
CN102915361A (en) Webpage text extracting method based on character distribution characteristic
CN105550169A (en) Method and device for identifying point of interest names based on character length
Eklund et al. Concept similarity and related categories in information retrieval using formal concept analysis
Xu et al. Analysis of large digital collections with interactive visualization
CN109086450B (en) Web deep network query interface detection method
CN103488743B (en) Page element extraction method and page element extraction system
CN105160032B (en) The determination method and device of the confidence level of interest point data in a kind of website
CN113806665A (en) Webpage blocking method based on non-patterned Web data model
Ibrahim et al. Exquisite: explaining quantities in text
Zeleny et al. Cluster-based Page Segmentation-a fast and precise method for web page pre-processing
KR101220960B1 (en) Geologic Ontology Service System

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant