CN109086450A - A kind of Web depth net query interface detection method - Google Patents

A kind of Web depth net query interface detection method Download PDF

Info

Publication number
CN109086450A
CN109086450A CN201810971193.3A CN201810971193A CN109086450A CN 109086450 A CN109086450 A CN 109086450A CN 201810971193 A CN201810971193 A CN 201810971193A CN 109086450 A CN109086450 A CN 109086450A
Authority
CN
China
Prior art keywords
piecemeal
interface
webpage
web
interactive interface
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810971193.3A
Other languages
Chinese (zh)
Other versions
CN109086450B (en
Inventor
于富财
涂轶文
章俊
费高雷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Original Assignee
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China filed Critical University of Electronic Science and Technology of China
Priority to CN201810971193.3A priority Critical patent/CN109086450B/en
Publication of CN109086450A publication Critical patent/CN109086450A/en
Application granted granted Critical
Publication of CN109086450B publication Critical patent/CN109086450B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/70Software maintenance or management
    • G06F8/72Code refactoring

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a kind of Web depth net query interface detection methods, comprising the following steps: S1, input webpage URL link address;S2, webpage rendering is carried out, by being laid out rendering engine, converts BOX Model for the display mode of HTML visible labels;S3, it is laid out piecemeal;S4, piecemeal beta pruning is carried out;S5, piecemeal reconstruct is carried out;S6, output interactive interface.The layout and style characteristics of web data is mainly utilized in the present invention, carries out area dividing to web data, and by formulating corresponding processing rule, the final positioning for realizing webpage interactive interface;The invention proposes the improved methods of a kind of combined interface structure feature and text feature, improve one-sided dependency structure feature and classify, and lead to the problem that classification accuracy is not high or adaptability is not strong.In experiment test, webpage interactive interface localization method has reached very high positioning correctness, and improved interface class feature set has then reached higher classifying quality.

Description

A kind of Web depth net query interface detection method
Technical field
The present invention relates to a kind of Web depth net query interface detection methods.
Background technique
With the rapid development of Internet, internet is closely bound up with daily life.And in order to meet people Growing living needs, internet also obtained the powerful motive force of development, show more and more network informations.According to Network size research report in 2003 is shown: be there are the data more than 2,000,000,000 GB in network, and is in and maintains sustained and rapid growth Situation.These Web informations how are effectively excavated and utilized, are the important topics that internet data excavates.
According to the acquisition difficulty of internet data, webpage information can be divided into two major classes: shallow net (Surface Web) With deep net (Deep Web, Invisible Web).Surface Web refers to that can be retrieved by traditional search engines Part Web information, this part Web information is usually static, and other Web pages are often embedded in the form of hyperlink In.Deep net information refers mainly to the part webpage information that traditional search engines cannot find and retrieve.Usually, deep net information In terms of mainly including following four: first, the letter being present in site page background data base (Web Database, WDB) Breath.This category information must be raw by website backstage dynamic by filling in inquiry form, and after submitting ad hoc inquiry to site page At;Second, lack the Web page that hyperlink is directed toward.It is directed toward due to lacking hyperlink, so traditional search engines can not index It arrives, this kind of Web page accounts about the 21.3% of entire ratio according to statistics;Third, the content restricteding access.This kind of Web page It may cannot be accessed due to the regulation of various policies, or need user's registration permission that could access;It can not be visited in 4th, Web The non-web page files asked.Mainly comprising picture file, pdf document and Word document etc..
According to the difference for obtaining deep net information mode, deep network data can be excavated can be divided into two kinds of forms:
(1) deep network data is integrated.This mode is applied primarily to the relevant technologies of data integration.
Deep network data is integrated to be broadly divided into three Main Stages: WDB query interface integration phase, and WDB inquires presentation stage And WDB query result processing stage.WDB query interface integration phase: it by improved traditional network crawler, finds and obtains WDB query interface is then understood query interface and is extracted pattern information, and looked into from the different WDB in same field It askes interface modes to be matched, integrates the integrated query interface of the unification in same field.WDB inquires presentation stage: the network user By filling in this integrated query interface, WDB to be checked is selected from the background, and the querying condition that will fill in is converted to correspondence Then the querying condition of WDB submits orientation inquiry to the WDB of selection respectively.WDB query result processing stage: to the WDB of return The query result page is analyzed, and extracts the back-end data of structuring, and carry out semantic tagger to query result information, finally Duplicate removal integration is carried out to all WDB query results, returns to final result to the network user, completes this deep network data inquiry. It is the extraction mode an of domain-oriented that deep network data is integrated, and user only needs integrated to look by filling in one in whole flow process Interface is ask, multiple databases in same field can be inquired.
(2) superficial depth net information.The basis of traditional search engines is mainly utilized in this mode for excavating deep net information Facility.Superficial depth net information and deep net integration mode main difference is that: superficial mode fill in query interface list be from Dynamic, do not need artificial participation;Superficial is that preparatory submit is inquired, and obtains query result, then same traditional search engines Similar, query result is indexed on static html page.
Deep net query interface is the discovery that the top priority of deep net information excavating, the accuracy rate and coverage rate of query interface discovery Directly concerning the validity for arriving subsequent processing.At present to the research of deep net query interface pattern extraction, very big hair is had been achieved for Exhibition: start gradually to turn to the research based on groups elements relationship by the research based on interface element;Start to be ground by rule-based Study carefully, gradually turns to the research based on machine learning techniques.It is netted for integrated direction from deep, the pattern extraction side based on groups elements Method is conducive to subsequent pattern match and integrates;And for superficial depth net direction message, the mode based on groups elements is taken out Method is taken then to be conducive to the understanding that automatic list fills in crawler to list, especially when there are effect of constraint value between interface element When, the pattern extraction method of grouping is convenient for crawler lock onto target search range, is conducive to the hit rate for improving effective list inquiry. Extraction research based on machine learning techniques is then conducive to improve algorithm adaptability.How the setting of Web page presentation is efficiently used Count information, propose more adaptable, more stable deep net query interface pattern extraction algorithm, be improve pattern extraction accuracy and The key of efficiency, while being also the key points and difficulties studied at present.
Summary of the invention
It is an object of the invention to overcome the deficiencies of the prior art and provide a kind of layout that web data is utilized and patterns Feature carries out area dividing to web data, and by formulating corresponding processing rule, final to realize determining for webpage interactive interface Position, the Web depth net query interface detection method with very high positioning correctness.
The purpose of the present invention is achieved through the following technical solutions: a kind of Web depth net query interface detection method, packet Include following steps:
S1, input webpage URL link address;
S2, webpage rendering is carried out, by being laid out rendering engine, converts box mould for the display mode of HTML visible labels Type;
S3, it is laid out piecemeal;
S4, piecemeal beta pruning is carried out;
S5, piecemeal reconstruct is carried out;
S6, output interactive interface.
Further, the step S3 is laid out the rule of piecemeal are as follows:
(1) if the corresponding HTML element label of piecemeal EB is<form>, current piecemeal is not divided;
(2) if the interface element density of piecemeal EB is lower than preset threshold value, current piecemeal needs to divide;Otherwise into one Step judges whether the region that current piecemeal is included contains multiple interactive interface regions, no if then current piecemeal needs to divide Then current piecemeal does not divide;
(3) if sub-piecemeals piecemeal EB different there are background color, divides current piecemeal;
(4) if there are separator sub-piecemeals by piecemeal EB, current piecemeal is divided by boundary of separator.
Further, the step S4 carries out the concrete methods of realizing of piecemeal beta pruning are as follows:
A piecemeal in S41, selection layout piecemeal results set, judges whether the piecemeal has interface element, if Then retain the piecemeal, and by the piecemeal labeled as processed;It is no to then follow the steps S42;
S42, judge whether the piecemeal is less than preset threshold value at a distance from Web page bottom, if then deleting the piecemeal, It is no to then follow the steps S43;
S43, judge whether comprising text node in the child node set of the corresponding dom tree model of the piecemeal BOX Model, if It is to then follow the steps S44, otherwise deletes the piecemeal;
S44, judge whether the graphic element density of the piecemeal is greater than preset proportionality coefficient γ 1, if then deleting this point Block, otherwise further judges whether the link density of the piecemeal is greater than preset threshold gamma 2, if then deleting the piecemeal, otherwise Retain the piecemeal and by the piecemeal labeled as processed, wherein 0≤γ 1≤1,0≤γ 2≤1;
S45, it checks in layout block collection with the presence or absence of untreated piecemeal, if then return step S41, otherwise exports All block collections being laid out in block collection.
The beneficial effects of the present invention are:
1, for the present invention in the orientation problem of deep net query interface, innovative proposes a kind of view based on webpage design Feel information, the method for carrying out the positioning of webpage interactive interface.Layout and the pattern spy of web data is mainly utilized in the localization method Sign carries out area dividing to web data, and by formulating corresponding processing rule, final to realize determining for webpage interactive interface Position.The localization method has evaded the limitation that previous dependence<form>label carries out interface position.
2, in deep net query interface identification problem, the improvement of a kind of combined interface structure feature and text feature is proposed Method improves one-sided dependency structure feature and classifies, and leads to the problem that classification accuracy is not high or adaptability is not strong. In experiment test, webpage interactive interface localization method has reached very high positioning correctness, and improved interface class feature Collection has then reached higher classifying quality.
Detailed description of the invention
Fig. 1 is the flow chart of Web depth net query interface detection method of the invention;
Fig. 2 is element BOX Model schematic diagram;
Fig. 3 is the flow chart of progress piecemeal beta pruning of the invention.
Specific embodiment
The present invention will find the problem for deep net query interface, and primary study wherein look by the positioning of webpage interactive interface and deep net It askes interface and identifies two sub-problems.The present invention proposes a kind of deep net query interface localization method of view-based access control model information, this method It can effectively avoid previous localization method to the Dependence Problem of<form>label.Then, it is based on webpage interactive interface localization method, It will identify that problem is extended research to deep net query interface, combination is carried out using the structure and text feature of webpage interactive interface Interface structure feature is used only to improve in the identification of query interface, and the problem for causing Classification and Identification accuracy rate not high.It ties below It closes attached drawing and further illustrates technical solution of the present invention.
As shown in Figure 1, a kind of Web depth net query interface detection method, comprising the following steps:
S1, input webpage URL link address;
S2, webpage rendering is carried out, by being laid out rendering engine, converts box mould for the display mode of HTML visible labels Type;
(1) element BOX Model (Element Box Model): element BOX Model refers to the HTML element that CSS style defines The mode of presentation.Each visible HTML element is after the rendering of browser pattern, all by a BOX Model as shown in Figure 2 Characterize its specific appearance form.Wherein, Content represents the particular content of HTML element;Padding represents inset spacing, is close to Content part, what is mainly presented is the background information of element;Border represents the frame of inset spacing;Margin represents outer back gauge, Default is transparent.
According to the design specification of W3C, after html web page is laid out rendering engine by browser, each HTML visible labels Arrangement will be presented in the page in a manner of BOX Model, finally obtain the real web pages that user sees.
(2) interactive interface element (Interface Element): interactive interface element defined herein refers to characterization interface A kind of HTML element of information.Specifically refer to element set a kind of in this way: ES=input, radio, checkbox, Text, select, textarea, button }, wherein radio, checkbox and text are certain types of<input>elements, Individually put forward to be in order to illustrate its importance.In general, webpage interactive interface at least has one kind of element in ES set. Interface element set ES is actually the subset of webpage interactive interface element.One group that most of interfaces are all general has been selected herein Element can reduce the complex nature of the problem as positioning datum.
(3) interactive interface region (Interface Area): interactive interface region refers to comprising all interactive interface elements The corresponding rectangular area of minimum BOX Model.It particularly, is<form>tag element BOX Model institute for form list Rectangular area.
It is to find interactive interface region that webpage interactive interface, which positions problem to be solved,.This is derived from the rule of webpage design Then: for nested html tag, it is however generally that presented in page layout and in the form of nested BOX Model.Webpage rendering Rendering step is the first step of localization method.Webpage URL link address is inputted first, it, will then by layout rendering engine The display mode of HTML visible labels is converted into BOX Model, and the processing of subsequent step will be based on obtained BOX Model.
S3, it is laid out piecemeal;The division methods of layout piecemeal proposed by the present invention, what is mainly used is the sample of piecemeal Formula and spatial layout feature.The feature for the degree of association and piecemeal itself between sub-block that method includes according to piecemeal is made whether to need The judgement to be divided.The heuristic rule being applied to division is described below:
(1) if the corresponding HTML element label of piecemeal EB is<form>, current piecemeal is not divided;
Regular (1) is to traditional description based on<form>tag location method.The purpose that piecemeal divides is that discovery belongs to In the element of the same interactive interface, it is clear that traditional localization method based on<form>label is exactly one to a certain extent Webpage interactive interface, so need not continue to divide.
(2) if the interface element density of piecemeal EB is lower than preset threshold value, current piecemeal needs to divide;Otherwise into one Step judges whether the region that current piecemeal is included contains multiple interactive interface regions, no if then current piecemeal needs to divide Then current piecemeal does not divide;
Regular (2) mainly utilize the spatial layout feature of webpage interactive interface, for an interactive interface piecemeal, The interface element sub-block that inside includes should occupy considerable proportion.If the interactive interface element density of piecemeal is low, side The interference information (nonreciprocal interface message) that specification interface may include is more, needs to divide it, be connect with reducing nonreciprocal Influence of the message breath to subsequent deep net query interface identification and classification, reduces experimental error.If the interactive interface element of piecemeal Density is high, it is also necessary to which further judgement judges whether the region that piecemeal is included contains multiple interactive interface regions.
(3) if sub-piecemeals piecemeal EB different there are background color, divides current piecemeal;
(4) if there are separator sub-piecemeals by piecemeal EB, current piecemeal is divided by boundary of separator.
Regular (3) are observed based on a kind of webpage design: web page release has different background colors, and generally meaning that has Different semantemes.Webpage design personnel can distinguish web page semantics block, in this way generally when designing webpage in order to user Interested content can intuitively be navigated to.Regular (4) and regular (3) are similar, and usual separator also signifies separator two The piecemeal on side has different semantemes.By regular (3) and rule (4), it can judge whether current piecemeal wraps from visual angle Multiple semantic chunks are contained, the partition problem in the intensive multiple interactive interface regions of arrangement can be solved to a certain extent.
S4, piecemeal beta pruning is carried out;By the way that preliminary region division can be carried out to webpage to page layout progress piecemeal, Obtain multiple visible areas.The purpose of the present invention is finding the potential region of deep net query interface, that is, need more data fields of comforming Fixation and recognition goes out webpage interactive interface region in domain.So other all incoherent data areas are all for this method It is noise, carries out deleting denoising, this is conducive to improve treatment effeciency.
It observes and analyzes through a large number of experiments, discovery webpage interactive interface region usually has the feature that
(1) webpage interactive interface region will not be located at the bottom of Web page.
(2) webpage interactive interface region will not have a large amount of picture and hyperlink.
As shown in figure 3, it is true based on the above observation, propose following pruning method:
A piecemeal in S41, selection layout piecemeal results set, judges whether the piecemeal has interface element, if Then retain the piecemeal, and by the piecemeal labeled as processed;It is no to then follow the steps S42;
S42, judge whether the piecemeal is less than preset threshold value at a distance from Web page bottom, if then deleting the piecemeal, It is no to then follow the steps S43;
S43, judge whether comprising text node in the child node set of the corresponding dom tree model of the piecemeal BOX Model, if It is to then follow the steps S44, otherwise deletes the piecemeal;
S44, judge whether the graphic element density of the piecemeal is greater than preset proportionality coefficient γ 1, if then deleting this point Block, otherwise further judges whether the link density of the piecemeal is greater than preset threshold gamma 2, if then deleting the piecemeal, otherwise Retain the piecemeal and by the piecemeal labeled as processed, wherein 0≤γ 1≤1,0≤γ 2≤1;
The interactive interface element, if piecemeal does not contain interactive interface element, intuitively for be unlikely to be friendship Mutual interface area, but be in practice likely to be a part in interactive interface region, its effect in interactive interface region is to phase Close interface element and carry out semantic tagger, provide semantic information for interface, this information to the identification of postorder depth net query interface with Classification has great importance, so need to remain.And through a large number of experiments it has been observed that this kind of Interface Semantic information It is usually presented by way of text node, and then we can be by measuring whether piecemeal has text node, to judge it Whether semantic information is provided.Further, even if piecemeal has text node, not also being all is interactive interface semantic information, We can by measure piecemeal graphic element density and link density size, come exclude piecemeal may be certain class webpage just A possibility that text and navigation area.
Specifically, piecemeal pruning method process is as shown in Figure 3.It is judged whether there is first according to layout piecemeal results set The presence of untreated piecemeal, if it is not, exporting remaining block collection;If there is untreated piecemeal, then just from knot A untreated piecemeal EB is chosen in fruit set, sees whether it has interface element, if there is interface element, then Kubo stays Piecemeal and mark piecemeal be it is processed, if further, checking it at a distance from page bottom without interface element Whether it is less than threshold value, so directly deletes the piecemeal if it is less than threshold value;If it is larger than or equal to threshold value, check whether it includes text This node deletes piecemeal if not including, if comprising text node, further, if the graphic element of piecemeal EB Density is greater than γ 1 or it links density and is greater than γ 2, then deleting the piecemeal.
S45, it checks in layout block collection with the presence or absence of untreated piecemeal, if then return step S41, otherwise exports All block collections being laid out in block collection.
S5, piecemeal reconstruct is carried out;
After all piecemeals are collectively labeled as processed, the division of all piecemeals and pruning just finish.Next Need to carry out piecemeal reconstruct, there are two purposes for piecemeal reconstruct:
Firstly, in order to correct the excessive partition problem of layout block phase.It is laid out block phase and passes through heuristic rule pair Whether piecemeal, which needs to continue to divide, is judged, but due to the heterogeneous and unstructured feature of problem itself, rule itself is simultaneously All situations cannot be covered, so need to reach the purpose for reducing error by being further processed.Particularly, if when interaction When the interface element density that interface area corresponds to piecemeal is unsatisfactory for requiring, it may be further divided into, mistake is caused to divide.For Solution is this because excessively dividing, and leads to the incomplete problem of interactive interface information, it would be desirable to divide close relation Block merges reconstruct, final locating web-pages interactive interface region.
Secondly, therefrom selecting to further be screened to the piecemeal for not including interactive interface element and being more likely to have There is the piecemeal of interactive interface area information.
What piecemeal reconstruct mainly utilized is Vision Design feature between piecemeal, webpage design personnel when designing webpage, Usually web data region is divided by intuitive visual signature, this Vision Design feature is web data excavation Provide good directive function.
S6, output interactive interface.
Those of ordinary skill in the art will understand that the embodiments described herein, which is to help reader, understands this hair Bright principle, it should be understood that protection scope of the present invention is not limited to such specific embodiments and embodiments.This field Those of ordinary skill disclosed the technical disclosures can make according to the present invention and various not depart from the other each of essence of the invention The specific variations and combinations of kind, these variations and combinations are still within the scope of the present invention.

Claims (3)

1. a kind of Web depth net query interface detection method, which comprises the following steps:
S1, input webpage URL link address;
S2, webpage rendering is carried out, by being laid out rendering engine, converts BOX Model for the display mode of HTML visible labels;
S3, it is laid out piecemeal;
S4, piecemeal beta pruning is carried out;
S5, piecemeal reconstruct is carried out;
S6, output interactive interface.
2. a kind of Web depth net query interface detection method according to claim 1, which is characterized in that the step S3 into The rule of row layout piecemeal are as follows:
(1) if the corresponding HTML element label of piecemeal EB is<form>, current piecemeal is not divided;
(2) if the interface element density of piecemeal EB is lower than preset threshold value, current piecemeal needs to divide;Otherwise further sentence Whether the region that current piecemeal is included of breaking contains multiple interactive interface regions, if then current piecemeal needs to divide, otherwise when Preceding piecemeal does not divide;
(3) if sub-piecemeals piecemeal EB different there are background color, divides current piecemeal;
(4) if there are separator sub-piecemeals by piecemeal EB, current piecemeal is divided by boundary of separator.
3. a kind of Web depth net query interface detection method according to claim 1, which is characterized in that the step S4 into The concrete methods of realizing of row piecemeal beta pruning are as follows:
A piecemeal in S41, selection layout piecemeal results set, judges whether the piecemeal has interface element, if then protecting The piecemeal is stayed, and by the piecemeal labeled as processed;It is no to then follow the steps S42;
S42, judge whether the piecemeal is less than preset threshold value at a distance from Web page bottom, if then deleting the piecemeal, otherwise Execute step S43;
S43, judge whether comprising text node in the child node set of the corresponding dom tree model of the piecemeal BOX Model, if then Step S44 is executed, the piecemeal is otherwise deleted;
S44, judge whether the graphic element density of the piecemeal is greater than preset proportionality coefficient γ 1, it is no if then deleting the piecemeal Then further judge whether the link density of the piecemeal is greater than preset threshold gamma 2, if then deleting the piecemeal, otherwise retaining should The piecemeal is simultaneously labeled as processed by piecemeal, wherein 0≤γ 1≤1,0≤γ 2≤1;
S45, it checks in layout block collection and whether there is untreated piecemeal, if then return step S41, otherwise output layout All block collections in block collection.
CN201810971193.3A 2018-08-24 2018-08-24 Web deep network query interface detection method Active CN109086450B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810971193.3A CN109086450B (en) 2018-08-24 2018-08-24 Web deep network query interface detection method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810971193.3A CN109086450B (en) 2018-08-24 2018-08-24 Web deep network query interface detection method

Publications (2)

Publication Number Publication Date
CN109086450A true CN109086450A (en) 2018-12-25
CN109086450B CN109086450B (en) 2021-08-27

Family

ID=64794531

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810971193.3A Active CN109086450B (en) 2018-08-24 2018-08-24 Web deep network query interface detection method

Country Status (1)

Country Link
CN (1) CN109086450B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11914598B2 (en) * 2022-05-27 2024-02-27 Sap Se Extended synopsis pruning in database management systems

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060031202A1 (en) * 2004-08-06 2006-02-09 Chang Kevin C Method and system for extracting web query interfaces
US20090077180A1 (en) * 2007-09-14 2009-03-19 Flowers John S Novel systems and methods for transmitting syntactically accurate messages over a network
CN101419625A (en) * 2008-12-02 2009-04-29 西安交通大学 Deep web self-adapting crawling method based on minimum searchable mode
CN101667201A (en) * 2009-09-18 2010-03-10 浙江大学 Integration method of Deep Web query interface based on tree merging
CN102135976A (en) * 2010-09-27 2011-07-27 华为技术有限公司 Hypertext markup language page structured data extraction method and device
CN103092913A (en) * 2012-11-29 2013-05-08 江苏瑞中数据股份有限公司 Method for judging deep web query interface by adopting iterative naive Bayesian classifier
CN103678490A (en) * 2013-11-14 2014-03-26 桂林电子科技大学 Deep Web query interface clustering method based on Hadoop platform
US20150106355A1 (en) * 2010-06-18 2015-04-16 Deep Web Technologies, Inc. Browser based multilingual federated search

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060031202A1 (en) * 2004-08-06 2006-02-09 Chang Kevin C Method and system for extracting web query interfaces
US20090077180A1 (en) * 2007-09-14 2009-03-19 Flowers John S Novel systems and methods for transmitting syntactically accurate messages over a network
CN101419625A (en) * 2008-12-02 2009-04-29 西安交通大学 Deep web self-adapting crawling method based on minimum searchable mode
CN101667201A (en) * 2009-09-18 2010-03-10 浙江大学 Integration method of Deep Web query interface based on tree merging
US20150106355A1 (en) * 2010-06-18 2015-04-16 Deep Web Technologies, Inc. Browser based multilingual federated search
CN102135976A (en) * 2010-09-27 2011-07-27 华为技术有限公司 Hypertext markup language page structured data extraction method and device
CN103092913A (en) * 2012-11-29 2013-05-08 江苏瑞中数据股份有限公司 Method for judging deep web query interface by adopting iterative naive Bayesian classifier
CN103678490A (en) * 2013-11-14 2014-03-26 桂林电子科技大学 Deep Web query interface clustering method based on Hadoop platform

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
周二虎 等: ""基于Deep Web检索的查询结果处理技术的应用"", 《计算机工程与设计》 *
谭涛 等: ""一种基于深网的个性化信息爬取方法"", 《电脑知识与技术》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11914598B2 (en) * 2022-05-27 2024-02-27 Sap Se Extended synopsis pruning in database management systems

Also Published As

Publication number Publication date
CN109086450B (en) 2021-08-27

Similar Documents

Publication Publication Date Title
CN101894134B (en) Spatial layout-based fishing webpage detection and implementation method
CN103823824B (en) A kind of method and system that text classification corpus is built automatically by the Internet
US20100169301A1 (en) System and method for aggregating and ranking data from a plurality of web sites
US7379932B2 (en) System and a method for focused re-crawling of Web sites
CN103853738B (en) A kind of recognition methods of info web correlation region
US20120303645A1 (en) System and method for extraction of structured data from arbitrarily structured composite data
Rastan et al. TEXUS: A unified framework for extracting and understanding tables in PDF documents
CN107577783A (en) The type of webpage automatic identifying method excavated based on Web architectural features
CN101727498A (en) Automatic extraction method of web page information based on WEB structure
CN103559234B (en) System and method for automated semantic annotation of RESTful Web services
EP2057557A2 (en) Joint optimization of wrapper generation and template detection
CN101751438A (en) Theme webpage filter system for driving self-adaption semantics
CN102170446A (en) Fishing webpage detection method based on spatial layout and visual features
Ji et al. Tag tree template for Web information and schema extraction
CN106503211A (en) Information issues the method that the mobile edition of class website is automatically generated
CN110413784A (en) The public sentiment association analysis method and system of knowledge based map
CN108733813A (en) Information extracting method, system towards BBS forum Web pages contents and medium
CN101350019B (en) Method for abstracting web page information based on vector model between predefined slots
CN108009215A (en) A kind of search results pages user behavior pattern assessment method, apparatus and system
CN109086450A (en) A kind of Web depth net query interface detection method
CN109213538A (en) A kind of extracting method and device of list page information
CN102460440B (en) Searching methods and devices
Weninger et al. Unexpected results in automatic list extraction on the web
CN103942224B (en) A kind of method and device for the mark rule obtaining web page release
Malerba et al. Machine learning for reading order detection in document image understanding

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant