CN109086450A - A kind of Web depth net query interface detection method - Google Patents
A kind of Web depth net query interface detection method Download PDFInfo
- Publication number
- CN109086450A CN109086450A CN201810971193.3A CN201810971193A CN109086450A CN 109086450 A CN109086450 A CN 109086450A CN 201810971193 A CN201810971193 A CN 201810971193A CN 109086450 A CN109086450 A CN 109086450A
- Authority
- CN
- China
- Prior art keywords
- piecemeal
- interface
- webpage
- web
- interactive interface
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/70—Software maintenance or management
- G06F8/72—Code refactoring
Landscapes
- Engineering & Computer Science (AREA)
- Software Systems (AREA)
- General Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Information Transfer Between Computers (AREA)
Abstract
The invention discloses a kind of Web depth net query interface detection methods, comprising the following steps: S1, input webpage URL link address;S2, webpage rendering is carried out, by being laid out rendering engine, converts BOX Model for the display mode of HTML visible labels;S3, it is laid out piecemeal;S4, piecemeal beta pruning is carried out;S5, piecemeal reconstruct is carried out;S6, output interactive interface.The layout and style characteristics of web data is mainly utilized in the present invention, carries out area dividing to web data, and by formulating corresponding processing rule, the final positioning for realizing webpage interactive interface;The invention proposes the improved methods of a kind of combined interface structure feature and text feature, improve one-sided dependency structure feature and classify, and lead to the problem that classification accuracy is not high or adaptability is not strong.In experiment test, webpage interactive interface localization method has reached very high positioning correctness, and improved interface class feature set has then reached higher classifying quality.
Description
Technical field
The present invention relates to a kind of Web depth net query interface detection methods.
Background technique
With the rapid development of Internet, internet is closely bound up with daily life.And in order to meet people
Growing living needs, internet also obtained the powerful motive force of development, show more and more network informations.According to
Network size research report in 2003 is shown: be there are the data more than 2,000,000,000 GB in network, and is in and maintains sustained and rapid growth
Situation.These Web informations how are effectively excavated and utilized, are the important topics that internet data excavates.
According to the acquisition difficulty of internet data, webpage information can be divided into two major classes: shallow net (Surface Web)
With deep net (Deep Web, Invisible Web).Surface Web refers to that can be retrieved by traditional search engines
Part Web information, this part Web information is usually static, and other Web pages are often embedded in the form of hyperlink
In.Deep net information refers mainly to the part webpage information that traditional search engines cannot find and retrieve.Usually, deep net information
In terms of mainly including following four: first, the letter being present in site page background data base (Web Database, WDB)
Breath.This category information must be raw by website backstage dynamic by filling in inquiry form, and after submitting ad hoc inquiry to site page
At;Second, lack the Web page that hyperlink is directed toward.It is directed toward due to lacking hyperlink, so traditional search engines can not index
It arrives, this kind of Web page accounts about the 21.3% of entire ratio according to statistics;Third, the content restricteding access.This kind of Web page
It may cannot be accessed due to the regulation of various policies, or need user's registration permission that could access;It can not be visited in 4th, Web
The non-web page files asked.Mainly comprising picture file, pdf document and Word document etc..
According to the difference for obtaining deep net information mode, deep network data can be excavated can be divided into two kinds of forms:
(1) deep network data is integrated.This mode is applied primarily to the relevant technologies of data integration.
Deep network data is integrated to be broadly divided into three Main Stages: WDB query interface integration phase, and WDB inquires presentation stage
And WDB query result processing stage.WDB query interface integration phase: it by improved traditional network crawler, finds and obtains
WDB query interface is then understood query interface and is extracted pattern information, and looked into from the different WDB in same field
It askes interface modes to be matched, integrates the integrated query interface of the unification in same field.WDB inquires presentation stage: the network user
By filling in this integrated query interface, WDB to be checked is selected from the background, and the querying condition that will fill in is converted to correspondence
Then the querying condition of WDB submits orientation inquiry to the WDB of selection respectively.WDB query result processing stage: to the WDB of return
The query result page is analyzed, and extracts the back-end data of structuring, and carry out semantic tagger to query result information, finally
Duplicate removal integration is carried out to all WDB query results, returns to final result to the network user, completes this deep network data inquiry.
It is the extraction mode an of domain-oriented that deep network data is integrated, and user only needs integrated to look by filling in one in whole flow process
Interface is ask, multiple databases in same field can be inquired.
(2) superficial depth net information.The basis of traditional search engines is mainly utilized in this mode for excavating deep net information
Facility.Superficial depth net information and deep net integration mode main difference is that: superficial mode fill in query interface list be from
Dynamic, do not need artificial participation;Superficial is that preparatory submit is inquired, and obtains query result, then same traditional search engines
Similar, query result is indexed on static html page.
Deep net query interface is the discovery that the top priority of deep net information excavating, the accuracy rate and coverage rate of query interface discovery
Directly concerning the validity for arriving subsequent processing.At present to the research of deep net query interface pattern extraction, very big hair is had been achieved for
Exhibition: start gradually to turn to the research based on groups elements relationship by the research based on interface element;Start to be ground by rule-based
Study carefully, gradually turns to the research based on machine learning techniques.It is netted for integrated direction from deep, the pattern extraction side based on groups elements
Method is conducive to subsequent pattern match and integrates;And for superficial depth net direction message, the mode based on groups elements is taken out
Method is taken then to be conducive to the understanding that automatic list fills in crawler to list, especially when there are effect of constraint value between interface element
When, the pattern extraction method of grouping is convenient for crawler lock onto target search range, is conducive to the hit rate for improving effective list inquiry.
Extraction research based on machine learning techniques is then conducive to improve algorithm adaptability.How the setting of Web page presentation is efficiently used
Count information, propose more adaptable, more stable deep net query interface pattern extraction algorithm, be improve pattern extraction accuracy and
The key of efficiency, while being also the key points and difficulties studied at present.
Summary of the invention
It is an object of the invention to overcome the deficiencies of the prior art and provide a kind of layout that web data is utilized and patterns
Feature carries out area dividing to web data, and by formulating corresponding processing rule, final to realize determining for webpage interactive interface
Position, the Web depth net query interface detection method with very high positioning correctness.
The purpose of the present invention is achieved through the following technical solutions: a kind of Web depth net query interface detection method, packet
Include following steps:
S1, input webpage URL link address;
S2, webpage rendering is carried out, by being laid out rendering engine, converts box mould for the display mode of HTML visible labels
Type;
S3, it is laid out piecemeal;
S4, piecemeal beta pruning is carried out;
S5, piecemeal reconstruct is carried out;
S6, output interactive interface.
Further, the step S3 is laid out the rule of piecemeal are as follows:
(1) if the corresponding HTML element label of piecemeal EB is<form>, current piecemeal is not divided;
(2) if the interface element density of piecemeal EB is lower than preset threshold value, current piecemeal needs to divide;Otherwise into one
Step judges whether the region that current piecemeal is included contains multiple interactive interface regions, no if then current piecemeal needs to divide
Then current piecemeal does not divide;
(3) if sub-piecemeals piecemeal EB different there are background color, divides current piecemeal;
(4) if there are separator sub-piecemeals by piecemeal EB, current piecemeal is divided by boundary of separator.
Further, the step S4 carries out the concrete methods of realizing of piecemeal beta pruning are as follows:
A piecemeal in S41, selection layout piecemeal results set, judges whether the piecemeal has interface element, if
Then retain the piecemeal, and by the piecemeal labeled as processed;It is no to then follow the steps S42;
S42, judge whether the piecemeal is less than preset threshold value at a distance from Web page bottom, if then deleting the piecemeal,
It is no to then follow the steps S43;
S43, judge whether comprising text node in the child node set of the corresponding dom tree model of the piecemeal BOX Model, if
It is to then follow the steps S44, otherwise deletes the piecemeal;
S44, judge whether the graphic element density of the piecemeal is greater than preset proportionality coefficient γ 1, if then deleting this point
Block, otherwise further judges whether the link density of the piecemeal is greater than preset threshold gamma 2, if then deleting the piecemeal, otherwise
Retain the piecemeal and by the piecemeal labeled as processed, wherein 0≤γ 1≤1,0≤γ 2≤1;
S45, it checks in layout block collection with the presence or absence of untreated piecemeal, if then return step S41, otherwise exports
All block collections being laid out in block collection.
The beneficial effects of the present invention are:
1, for the present invention in the orientation problem of deep net query interface, innovative proposes a kind of view based on webpage design
Feel information, the method for carrying out the positioning of webpage interactive interface.Layout and the pattern spy of web data is mainly utilized in the localization method
Sign carries out area dividing to web data, and by formulating corresponding processing rule, final to realize determining for webpage interactive interface
Position.The localization method has evaded the limitation that previous dependence<form>label carries out interface position.
2, in deep net query interface identification problem, the improvement of a kind of combined interface structure feature and text feature is proposed
Method improves one-sided dependency structure feature and classifies, and leads to the problem that classification accuracy is not high or adaptability is not strong.
In experiment test, webpage interactive interface localization method has reached very high positioning correctness, and improved interface class feature
Collection has then reached higher classifying quality.
Detailed description of the invention
Fig. 1 is the flow chart of Web depth net query interface detection method of the invention;
Fig. 2 is element BOX Model schematic diagram;
Fig. 3 is the flow chart of progress piecemeal beta pruning of the invention.
Specific embodiment
The present invention will find the problem for deep net query interface, and primary study wherein look by the positioning of webpage interactive interface and deep net
It askes interface and identifies two sub-problems.The present invention proposes a kind of deep net query interface localization method of view-based access control model information, this method
It can effectively avoid previous localization method to the Dependence Problem of<form>label.Then, it is based on webpage interactive interface localization method,
It will identify that problem is extended research to deep net query interface, combination is carried out using the structure and text feature of webpage interactive interface
Interface structure feature is used only to improve in the identification of query interface, and the problem for causing Classification and Identification accuracy rate not high.It ties below
It closes attached drawing and further illustrates technical solution of the present invention.
As shown in Figure 1, a kind of Web depth net query interface detection method, comprising the following steps:
S1, input webpage URL link address;
S2, webpage rendering is carried out, by being laid out rendering engine, converts box mould for the display mode of HTML visible labels
Type;
(1) element BOX Model (Element Box Model): element BOX Model refers to the HTML element that CSS style defines
The mode of presentation.Each visible HTML element is after the rendering of browser pattern, all by a BOX Model as shown in Figure 2
Characterize its specific appearance form.Wherein, Content represents the particular content of HTML element;Padding represents inset spacing, is close to
Content part, what is mainly presented is the background information of element;Border represents the frame of inset spacing;Margin represents outer back gauge,
Default is transparent.
According to the design specification of W3C, after html web page is laid out rendering engine by browser, each HTML visible labels
Arrangement will be presented in the page in a manner of BOX Model, finally obtain the real web pages that user sees.
(2) interactive interface element (Interface Element): interactive interface element defined herein refers to characterization interface
A kind of HTML element of information.Specifically refer to element set a kind of in this way: ES=input, radio, checkbox,
Text, select, textarea, button }, wherein radio, checkbox and text are certain types of<input>elements,
Individually put forward to be in order to illustrate its importance.In general, webpage interactive interface at least has one kind of element in ES set.
Interface element set ES is actually the subset of webpage interactive interface element.One group that most of interfaces are all general has been selected herein
Element can reduce the complex nature of the problem as positioning datum.
(3) interactive interface region (Interface Area): interactive interface region refers to comprising all interactive interface elements
The corresponding rectangular area of minimum BOX Model.It particularly, is<form>tag element BOX Model institute for form list
Rectangular area.
It is to find interactive interface region that webpage interactive interface, which positions problem to be solved,.This is derived from the rule of webpage design
Then: for nested html tag, it is however generally that presented in page layout and in the form of nested BOX Model.Webpage rendering
Rendering step is the first step of localization method.Webpage URL link address is inputted first, it, will then by layout rendering engine
The display mode of HTML visible labels is converted into BOX Model, and the processing of subsequent step will be based on obtained BOX Model.
S3, it is laid out piecemeal;The division methods of layout piecemeal proposed by the present invention, what is mainly used is the sample of piecemeal
Formula and spatial layout feature.The feature for the degree of association and piecemeal itself between sub-block that method includes according to piecemeal is made whether to need
The judgement to be divided.The heuristic rule being applied to division is described below:
(1) if the corresponding HTML element label of piecemeal EB is<form>, current piecemeal is not divided;
Regular (1) is to traditional description based on<form>tag location method.The purpose that piecemeal divides is that discovery belongs to
In the element of the same interactive interface, it is clear that traditional localization method based on<form>label is exactly one to a certain extent
Webpage interactive interface, so need not continue to divide.
(2) if the interface element density of piecemeal EB is lower than preset threshold value, current piecemeal needs to divide;Otherwise into one
Step judges whether the region that current piecemeal is included contains multiple interactive interface regions, no if then current piecemeal needs to divide
Then current piecemeal does not divide;
Regular (2) mainly utilize the spatial layout feature of webpage interactive interface, for an interactive interface piecemeal,
The interface element sub-block that inside includes should occupy considerable proportion.If the interactive interface element density of piecemeal is low, side
The interference information (nonreciprocal interface message) that specification interface may include is more, needs to divide it, be connect with reducing nonreciprocal
Influence of the message breath to subsequent deep net query interface identification and classification, reduces experimental error.If the interactive interface element of piecemeal
Density is high, it is also necessary to which further judgement judges whether the region that piecemeal is included contains multiple interactive interface regions.
(3) if sub-piecemeals piecemeal EB different there are background color, divides current piecemeal;
(4) if there are separator sub-piecemeals by piecemeal EB, current piecemeal is divided by boundary of separator.
Regular (3) are observed based on a kind of webpage design: web page release has different background colors, and generally meaning that has
Different semantemes.Webpage design personnel can distinguish web page semantics block, in this way generally when designing webpage in order to user
Interested content can intuitively be navigated to.Regular (4) and regular (3) are similar, and usual separator also signifies separator two
The piecemeal on side has different semantemes.By regular (3) and rule (4), it can judge whether current piecemeal wraps from visual angle
Multiple semantic chunks are contained, the partition problem in the intensive multiple interactive interface regions of arrangement can be solved to a certain extent.
S4, piecemeal beta pruning is carried out;By the way that preliminary region division can be carried out to webpage to page layout progress piecemeal,
Obtain multiple visible areas.The purpose of the present invention is finding the potential region of deep net query interface, that is, need more data fields of comforming
Fixation and recognition goes out webpage interactive interface region in domain.So other all incoherent data areas are all for this method
It is noise, carries out deleting denoising, this is conducive to improve treatment effeciency.
It observes and analyzes through a large number of experiments, discovery webpage interactive interface region usually has the feature that
(1) webpage interactive interface region will not be located at the bottom of Web page.
(2) webpage interactive interface region will not have a large amount of picture and hyperlink.
As shown in figure 3, it is true based on the above observation, propose following pruning method:
A piecemeal in S41, selection layout piecemeal results set, judges whether the piecemeal has interface element, if
Then retain the piecemeal, and by the piecemeal labeled as processed;It is no to then follow the steps S42;
S42, judge whether the piecemeal is less than preset threshold value at a distance from Web page bottom, if then deleting the piecemeal,
It is no to then follow the steps S43;
S43, judge whether comprising text node in the child node set of the corresponding dom tree model of the piecemeal BOX Model, if
It is to then follow the steps S44, otherwise deletes the piecemeal;
S44, judge whether the graphic element density of the piecemeal is greater than preset proportionality coefficient γ 1, if then deleting this point
Block, otherwise further judges whether the link density of the piecemeal is greater than preset threshold gamma 2, if then deleting the piecemeal, otherwise
Retain the piecemeal and by the piecemeal labeled as processed, wherein 0≤γ 1≤1,0≤γ 2≤1;
The interactive interface element, if piecemeal does not contain interactive interface element, intuitively for be unlikely to be friendship
Mutual interface area, but be in practice likely to be a part in interactive interface region, its effect in interactive interface region is to phase
Close interface element and carry out semantic tagger, provide semantic information for interface, this information to the identification of postorder depth net query interface with
Classification has great importance, so need to remain.And through a large number of experiments it has been observed that this kind of Interface Semantic information
It is usually presented by way of text node, and then we can be by measuring whether piecemeal has text node, to judge it
Whether semantic information is provided.Further, even if piecemeal has text node, not also being all is interactive interface semantic information,
We can by measure piecemeal graphic element density and link density size, come exclude piecemeal may be certain class webpage just
A possibility that text and navigation area.
Specifically, piecemeal pruning method process is as shown in Figure 3.It is judged whether there is first according to layout piecemeal results set
The presence of untreated piecemeal, if it is not, exporting remaining block collection;If there is untreated piecemeal, then just from knot
A untreated piecemeal EB is chosen in fruit set, sees whether it has interface element, if there is interface element, then Kubo stays
Piecemeal and mark piecemeal be it is processed, if further, checking it at a distance from page bottom without interface element
Whether it is less than threshold value, so directly deletes the piecemeal if it is less than threshold value;If it is larger than or equal to threshold value, check whether it includes text
This node deletes piecemeal if not including, if comprising text node, further, if the graphic element of piecemeal EB
Density is greater than γ 1 or it links density and is greater than γ 2, then deleting the piecemeal.
S45, it checks in layout block collection with the presence or absence of untreated piecemeal, if then return step S41, otherwise exports
All block collections being laid out in block collection.
S5, piecemeal reconstruct is carried out;
After all piecemeals are collectively labeled as processed, the division of all piecemeals and pruning just finish.Next
Need to carry out piecemeal reconstruct, there are two purposes for piecemeal reconstruct:
Firstly, in order to correct the excessive partition problem of layout block phase.It is laid out block phase and passes through heuristic rule pair
Whether piecemeal, which needs to continue to divide, is judged, but due to the heterogeneous and unstructured feature of problem itself, rule itself is simultaneously
All situations cannot be covered, so need to reach the purpose for reducing error by being further processed.Particularly, if when interaction
When the interface element density that interface area corresponds to piecemeal is unsatisfactory for requiring, it may be further divided into, mistake is caused to divide.For
Solution is this because excessively dividing, and leads to the incomplete problem of interactive interface information, it would be desirable to divide close relation
Block merges reconstruct, final locating web-pages interactive interface region.
Secondly, therefrom selecting to further be screened to the piecemeal for not including interactive interface element and being more likely to have
There is the piecemeal of interactive interface area information.
What piecemeal reconstruct mainly utilized is Vision Design feature between piecemeal, webpage design personnel when designing webpage,
Usually web data region is divided by intuitive visual signature, this Vision Design feature is web data excavation
Provide good directive function.
S6, output interactive interface.
Those of ordinary skill in the art will understand that the embodiments described herein, which is to help reader, understands this hair
Bright principle, it should be understood that protection scope of the present invention is not limited to such specific embodiments and embodiments.This field
Those of ordinary skill disclosed the technical disclosures can make according to the present invention and various not depart from the other each of essence of the invention
The specific variations and combinations of kind, these variations and combinations are still within the scope of the present invention.
Claims (3)
1. a kind of Web depth net query interface detection method, which comprises the following steps:
S1, input webpage URL link address;
S2, webpage rendering is carried out, by being laid out rendering engine, converts BOX Model for the display mode of HTML visible labels;
S3, it is laid out piecemeal;
S4, piecemeal beta pruning is carried out;
S5, piecemeal reconstruct is carried out;
S6, output interactive interface.
2. a kind of Web depth net query interface detection method according to claim 1, which is characterized in that the step S3 into
The rule of row layout piecemeal are as follows:
(1) if the corresponding HTML element label of piecemeal EB is<form>, current piecemeal is not divided;
(2) if the interface element density of piecemeal EB is lower than preset threshold value, current piecemeal needs to divide;Otherwise further sentence
Whether the region that current piecemeal is included of breaking contains multiple interactive interface regions, if then current piecemeal needs to divide, otherwise when
Preceding piecemeal does not divide;
(3) if sub-piecemeals piecemeal EB different there are background color, divides current piecemeal;
(4) if there are separator sub-piecemeals by piecemeal EB, current piecemeal is divided by boundary of separator.
3. a kind of Web depth net query interface detection method according to claim 1, which is characterized in that the step S4 into
The concrete methods of realizing of row piecemeal beta pruning are as follows:
A piecemeal in S41, selection layout piecemeal results set, judges whether the piecemeal has interface element, if then protecting
The piecemeal is stayed, and by the piecemeal labeled as processed;It is no to then follow the steps S42;
S42, judge whether the piecemeal is less than preset threshold value at a distance from Web page bottom, if then deleting the piecemeal, otherwise
Execute step S43;
S43, judge whether comprising text node in the child node set of the corresponding dom tree model of the piecemeal BOX Model, if then
Step S44 is executed, the piecemeal is otherwise deleted;
S44, judge whether the graphic element density of the piecemeal is greater than preset proportionality coefficient γ 1, it is no if then deleting the piecemeal
Then further judge whether the link density of the piecemeal is greater than preset threshold gamma 2, if then deleting the piecemeal, otherwise retaining should
The piecemeal is simultaneously labeled as processed by piecemeal, wherein 0≤γ 1≤1,0≤γ 2≤1;
S45, it checks in layout block collection and whether there is untreated piecemeal, if then return step S41, otherwise output layout
All block collections in block collection.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810971193.3A CN109086450B (en) | 2018-08-24 | 2018-08-24 | Web deep network query interface detection method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810971193.3A CN109086450B (en) | 2018-08-24 | 2018-08-24 | Web deep network query interface detection method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109086450A true CN109086450A (en) | 2018-12-25 |
CN109086450B CN109086450B (en) | 2021-08-27 |
Family
ID=64794531
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810971193.3A Active CN109086450B (en) | 2018-08-24 | 2018-08-24 | Web deep network query interface detection method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109086450B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11914598B2 (en) * | 2022-05-27 | 2024-02-27 | Sap Se | Extended synopsis pruning in database management systems |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060031202A1 (en) * | 2004-08-06 | 2006-02-09 | Chang Kevin C | Method and system for extracting web query interfaces |
US20090077180A1 (en) * | 2007-09-14 | 2009-03-19 | Flowers John S | Novel systems and methods for transmitting syntactically accurate messages over a network |
CN101419625A (en) * | 2008-12-02 | 2009-04-29 | 西安交通大学 | Deep web self-adapting crawling method based on minimum searchable mode |
CN101667201A (en) * | 2009-09-18 | 2010-03-10 | 浙江大学 | Integration method of Deep Web query interface based on tree merging |
CN102135976A (en) * | 2010-09-27 | 2011-07-27 | 华为技术有限公司 | Hypertext markup language page structured data extraction method and device |
CN103092913A (en) * | 2012-11-29 | 2013-05-08 | 江苏瑞中数据股份有限公司 | Method for judging deep web query interface by adopting iterative naive Bayesian classifier |
CN103678490A (en) * | 2013-11-14 | 2014-03-26 | 桂林电子科技大学 | Deep Web query interface clustering method based on Hadoop platform |
US20150106355A1 (en) * | 2010-06-18 | 2015-04-16 | Deep Web Technologies, Inc. | Browser based multilingual federated search |
-
2018
- 2018-08-24 CN CN201810971193.3A patent/CN109086450B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060031202A1 (en) * | 2004-08-06 | 2006-02-09 | Chang Kevin C | Method and system for extracting web query interfaces |
US20090077180A1 (en) * | 2007-09-14 | 2009-03-19 | Flowers John S | Novel systems and methods for transmitting syntactically accurate messages over a network |
CN101419625A (en) * | 2008-12-02 | 2009-04-29 | 西安交通大学 | Deep web self-adapting crawling method based on minimum searchable mode |
CN101667201A (en) * | 2009-09-18 | 2010-03-10 | 浙江大学 | Integration method of Deep Web query interface based on tree merging |
US20150106355A1 (en) * | 2010-06-18 | 2015-04-16 | Deep Web Technologies, Inc. | Browser based multilingual federated search |
CN102135976A (en) * | 2010-09-27 | 2011-07-27 | 华为技术有限公司 | Hypertext markup language page structured data extraction method and device |
CN103092913A (en) * | 2012-11-29 | 2013-05-08 | 江苏瑞中数据股份有限公司 | Method for judging deep web query interface by adopting iterative naive Bayesian classifier |
CN103678490A (en) * | 2013-11-14 | 2014-03-26 | 桂林电子科技大学 | Deep Web query interface clustering method based on Hadoop platform |
Non-Patent Citations (2)
Title |
---|
周二虎 等: ""基于Deep Web检索的查询结果处理技术的应用"", 《计算机工程与设计》 * |
谭涛 等: ""一种基于深网的个性化信息爬取方法"", 《电脑知识与技术》 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11914598B2 (en) * | 2022-05-27 | 2024-02-27 | Sap Se | Extended synopsis pruning in database management systems |
Also Published As
Publication number | Publication date |
---|---|
CN109086450B (en) | 2021-08-27 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN101894134B (en) | Spatial layout-based fishing webpage detection and implementation method | |
CN103823824B (en) | A kind of method and system that text classification corpus is built automatically by the Internet | |
US20100169301A1 (en) | System and method for aggregating and ranking data from a plurality of web sites | |
US7379932B2 (en) | System and a method for focused re-crawling of Web sites | |
CN103853738B (en) | A kind of recognition methods of info web correlation region | |
US20120303645A1 (en) | System and method for extraction of structured data from arbitrarily structured composite data | |
Rastan et al. | TEXUS: A unified framework for extracting and understanding tables in PDF documents | |
CN107577783A (en) | The type of webpage automatic identifying method excavated based on Web architectural features | |
CN101727498A (en) | Automatic extraction method of web page information based on WEB structure | |
CN103559234B (en) | System and method for automated semantic annotation of RESTful Web services | |
EP2057557A2 (en) | Joint optimization of wrapper generation and template detection | |
CN101751438A (en) | Theme webpage filter system for driving self-adaption semantics | |
CN102170446A (en) | Fishing webpage detection method based on spatial layout and visual features | |
Ji et al. | Tag tree template for Web information and schema extraction | |
CN106503211A (en) | Information issues the method that the mobile edition of class website is automatically generated | |
CN110413784A (en) | The public sentiment association analysis method and system of knowledge based map | |
CN108733813A (en) | Information extracting method, system towards BBS forum Web pages contents and medium | |
CN101350019B (en) | Method for abstracting web page information based on vector model between predefined slots | |
CN108009215A (en) | A kind of search results pages user behavior pattern assessment method, apparatus and system | |
CN109086450A (en) | A kind of Web depth net query interface detection method | |
CN109213538A (en) | A kind of extracting method and device of list page information | |
CN102460440B (en) | Searching methods and devices | |
Weninger et al. | Unexpected results in automatic list extraction on the web | |
CN103942224B (en) | A kind of method and device for the mark rule obtaining web page release | |
Malerba et al. | Machine learning for reading order detection in document image understanding |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |