CN103257981B - Deep web data based on query interface attributive character is come to the surface method - Google Patents

Deep web data based on query interface attributive character is come to the surface method Download PDF

Info

Publication number
CN103257981B
CN103257981B CN201210191981.3A CN201210191981A CN103257981B CN 103257981 B CN103257981 B CN 103257981B CN 201210191981 A CN201210191981 A CN 201210191981A CN 103257981 B CN103257981 B CN 103257981B
Authority
CN
China
Prior art keywords
attribute
data
inquiry
value
query
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201210191981.3A
Other languages
Chinese (zh)
Other versions
CN103257981A (en
Inventor
赵朋朋
鲜学丰
辛洁
郭建兵
崔志明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou University
Original Assignee
Suzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou University filed Critical Suzhou University
Priority to CN201210191981.3A priority Critical patent/CN103257981B/en
Publication of CN103257981A publication Critical patent/CN103257981A/en
Application granted granted Critical
Publication of CN103257981B publication Critical patent/CN103257981B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The invention reside in and disclose a kind of Deep based on query interface attributive character? web data is come to the surface method, comprises query interface pattern information and extracts; The attribute that cleaning inquiry is irrelevant; Cleaning rubbish property value; Attributive classification; Assembling inquiry; Query set; Judge whether to reach certain coverage; If so, then the method flow process terminates; If not, judge whether query set is empty; If so, then data are submitted in Sample Storehouse through field Sample Storehouse; If not, then data are crawled module through data and data record abstraction module is submitted in Sample Storehouse.The data surface method that the present invention is based on query interface attributive character can obtain higher data surface efficiency, and effectively can solve the problem of Top-k in query interface.

Description

Deep web data based on query interface attributive character is come to the surface method
Technical field
The present invention relates to a kind of DeepWeb data integrating method, particularly relate to a kind of data reptile method for DeepWeb data source.
Background technology
Ripe gradually along with Web technology such as HTML, HTTP, the growth of the Web site number on Internet and webpage number all exponentially level.The latest survey result display that US Internet research institution Netcraft issues, in March, 2012, the Websites quantity that enlivens in the whole world was 644,275, and 754, compare and add 3,140 ten thousand February in the same year, amplification reaches 5.1%.In addition, the enquiry data according to another study Internet mechanism of family VeriSign shows, and fourth quarter in 2011 newly promotes 6,000,000 internet domain names, makes global domain name sum reach 2.25 hundred million.
Increasing website employs web page template technology in addition, and this makes website data progressively " in-depth ".The inquiry that server receives user is submitted to, then dynamically produces data record by background data base and is filled in fixing web page template, the data in website are not obtained by predefined hyperlink.Traditional web crawlers just obtains the page needing to crawl by existing hyperlink, the search engine that this part data mutual transmission therefore generated by web page template is united is hiding.This kind of website is referred to as DeepWeb(also known as HiddenWeb, InvisibleWeb by us).And SurfaceWeb can be called by the website of static hyperlink access total data.
Summary of the invention
For the problem and shortage existing for existing DeepWeb data surface method, the object of this invention is to provide a kind of DeepWeb data surface method based on query interface attributive character, thus improve the efficiency of coming to the surface of data.
For realizing above-mentioned technical purpose, reach above-mentioned technique effect, the present invention is achieved through the following technical solutions:
Based on the DeepWeb data surface method of query interface attributive character, comprise the following steps:
Step 1) query interface pattern information extracts;
Step 2) clean the irrelevant attribute of inquiry;
Step 3) cleaning rubbish property value;
Step 4) attributive classification;
Step 5) determines whether range type attribute, if so, performs step 6; If not, perform step 7;
Step 6) utilizes range type attribute to sample, and according to the distribution of sample on interval, divides after between range type attribute area and performs step 11;
Step 7) determines whether categorical attribute, if so, performs step 8; If not, perform step 9;
Step 8) candidate value extracts, and builds hierarchical tree, carries out overflow inquiry, if so, performs step 9; If not, perform step 11;
Step 9) determines whether text-type attribute, if so, performs step 10;
Step 10) obtains candidate value, screens respectively based on coverage rate and mutual information to candidate value, performs step 11;
Step 11) assembling inquiry;
Step 12) query set;
Step 13) judges whether to reach certain coverage; If so, then the method flow process terminates; If not, then perform step 14;
Step 14) judges whether that query set is empty; If so, then step 15 is performed; If not, then perform step 16;
Data are submitted in Sample Storehouse through field Sample Storehouse by step 15), after carry out the acquisition of the candidate value of step 10;
Step 16) data are crawled module through data and data record abstraction module is submitted in Sample Storehouse, after carry out the acquisition of the candidate value of step 10.
Further, the construction method of hierarchical tree is as follows:
A. the root node of a virtual tree, the total data record in this node on behalf target database;
B. each limit sent from root node represents a q, 1a property value; I-th node on behalf of the second layer of tree is with a q, 1=v 1, ias the set of the data record that querying condition obtains;
If c. the data record number of query hit is 0, be then designated as sky node.If the data record number of query hit is less than or equal to k and is greater than 0, be then labeled as effective leaf node; Otherwise, if the data record number of hit is greater than k, be then labeled as overflow node;
D. respectively using the overflow node of the second layer in hierarchical tree as root node, according to identical method, second categorical attribute a is selected q, 2in candidate value hierarchical tree is expanded;
E. hierarchical tree is expanded in the same way, until there is not the leaf node of overflow in the hierarchical tree built, or A multiin there is not the attribute be not traversed;
Attribute in sequence of attributes that and if only if arranges according to the size ascending order of its Value space, namely time, the hierarchical tree of structure is optimum; Inquiry can be made to submit least number of times to.
Further, the screening step of candidate value is as follows:
A. text-type attribute a is calculated q,iadding inquiry to submits to the overflow before community set to inquire about the data record number of hitting, and is designated as num overflow; If num validfor text type attribute gets the sum that different candidate values adds the data record hit in search sequence respectively to, num validinitial value be 0;
If b. Que ithere is not the element be not traversed, then the data surface on this attribute terminates, otherwise carries out step c;
C. from sequence Que inever accessed first element of middle selection is as the value of text-type attribute; Adding inquiry to submits in sequence; The data record number of this query hit is assigned to temporary variable num tmp;
D. by num tmpvalue and original num validvalue is added, and result is assigned to num valid.If , then the data surface on this attribute terminates; Otherwise, get back to step b.
Principle of the present invention is:
Data surface method defines:
1) problem definition
In DeepWeb website, user is merely able to the data record being obtained hit by inquiry.Utilize computer simulation user to inquire about the process of submission, the deep layer data obtaining DeepWeb website are referred to as coming to the surface of data.In the application of reality, not only want it is considered that web data storehouse degree of coming to the surface but also need consider inquiry submit to cost.Query Cost is higher, and what just need more to grow comes to the surface the time, and the pressure of destination server is also larger.In addition, the access times of increasing Web server to IP single in the unit interval have a definite limitation, and the efficiency therefore how improving data surface just becomes a problem demanding prompt solution.
In order to the data that can efficiently and as far as possible fully come to the surface in target web data storehouse, we need the data record that at every turn returns many as much as possible, and the element occured simultaneously between data record set in multiple queries result is few as much as possible.But find by carrying out investigation to the web data storehouse in several field, the data record number having the website of 89% at every turn to inquire about to return has the upper limit, this website single is inquired about the data record number that can come to the surface at most and is designated as k by us, and this phenomenon is referred to as Top-k.When the data record number of query hit is greater than k, only return the data record of k before rank, this inquiry is called that overflow is inquired about.When the data record number of query hit is 0, represent that such inquiry is called that underflow is inquired about not by the data record of this query hit in targeted sites.The inquiry of other situations is called effective query.
2) classification of query interface attribute and cleaning
HTML control, the scope of codomain and the effect difference in querying condition corresponding according to the element in query interface community set, can be divided into following three classes by the attribute in query interface:
(1) range type attribute
By specifying the scope of this property value to retrieve the data in target WDB in query interface.The HTML control that this generic attribute is corresponding mostly is two text boxes, is used for receiving lower limit and the upper limit in this interval.Common are: initial price, initial time etc.
(2) categorical attribute
Limited candidate value of this attribute is specified in query interface.User, by the selection to candidate value, retrieves the data in target WDB.The HTML control that common categorical attribute is corresponding has, drop-down list box, radio button group, check box etc.
(3) text-type attribute
User is the free assignment of this attribute, and comprises the data record of this value in searched targets WDB.The HTML control that such attribute is corresponding is text box freely, can not obtain the codomain of this attribute from query interface.
It should be noted that, if range type attribute semantically shows as drop-down list box at the HTML control that query interface is corresponding, this kind of attribute is classified as categorical attribute.
In addition, the value of the value of classifying type and text-type attribute and the data record corresponding field of hit is the relation of inclusion on word content.And the value of the data record corresponding field of the value of range type attribute and hit is the relation of inclusion on interval.This is the essential distinction of range type attribute and other types attribute.Classifying type is different with the obtain manner that the difference of text-type attribute is mainly manifested in candidate value.
Extract the attribute and candidate value obtained from query interface, not all HTML control is all useful to inquiry.Such as, the drop-down list box of sort by and the list box etc. of designated recorder display mode or display record number is used to specify.The submission of these property values does not affect the data record returned.We are defined as and are inquired about irrelevant attribute.The control features that the attribute that inquiry has nothing to do is corresponding is obvious.Candidate value in addition in the corresponding query interface of categorical attribute is always not effective.Obviously, this property value can't do any restriction to inquiry on this attribute, and this type of rubbish property value filters out by we.Following heuristic rule may be used for inquiring about the judgement of irrelevant attribute and the cleaning of rubbish property value.
(1) control type that the irrelevant attribute of inquiry is corresponding is point type control such as drop-down list box, check box, and has the obvious text description of feature such as " sequence ", " display record number " before such control more.
(2) if comprise the keyword such as " all ", " not limitting " in the candidate value of drop-down list, then these property values are labeled as rubbish property value.
The utility model beneficial effect is:
(1) contain much information.DeepWeb data volume is huge, between 66,800-91,850TB, is 400-500 times of SurfaceWeb information.And the visit capacity of DeepWeb website exceeds more than 15% of SurfaceWeb.In recent years, DeepWeb website number is increased sharply.CompletePlanet website is added up 60DeepWeb website, statistical result showed, and the total amount of data of 60 websites reaches 7500,000G.
(2) information growth rate is fast.Information scale in current DeepWeb, also in rapid increase, co-exists in 74 in 2006 according to statistics, 000 query interface, reaches 600,000 to query interface quantity at the beginning of 2011.Growth rate every year on average reaches 142.2%.
(3) information quality is high.Data in DeepWeb website more than 50% are high-quality structural data, and the data in website have extremely strong field correlativity and availability.
In addition, there is the information of more than 95% openly to access in DeepWeb, namely can be obtained by network for free.This is also for DeepWeb data acquisition provides feasibility.The effect of DeepWeb data acquisition is, it can feed back to user by after the data summarization in different for same area website.And by the localization to particular station data, the rule of this station data change can be analyzed.In actual applications, the price comparing system and price changing trend analysis etc. of e-commerce field have used the data acquisition of DeepWeb.
So the feasibility that advantage and DeepWeb in view of more than DeepWeb obtain and practicality, the research of the data surface method of DeepWeb is just provided with great meaning.
Accompanying drawing explanation
Fig. 1 is data surface system flowchart of the present invention;
Fig. 2 is the hierarchical tree example that the present invention utilizes categorical attribute to build.
Embodiment
Below with reference to the accompanying drawings and in conjunction with the embodiments, describe the present invention in detail.
Based on the DeepWeb data surface method of query interface attributive character, comprise the following steps:
Step 1) query interface pattern information extracts;
Step 2) clean the irrelevant attribute of inquiry;
Step 3) cleaning rubbish property value;
Step 4) attributive classification;
Step 5) determines whether range type attribute, if so, performs step 6; If not, perform step 7;
Step 6) utilizes range type attribute to sample, and according to the distribution of sample on interval, divides after between range type attribute area and performs step 11;
Step 7) determines whether categorical attribute, if so, performs step 8; If not, perform step 9;
Step 8) candidate value extracts, and builds hierarchical tree, carries out overflow inquiry, if so, performs step 9; If not, perform step 11;
Step 9) determines whether text-type attribute, if so, performs step 10;
Step 10) obtains candidate value, screens respectively based on coverage rate and mutual information to candidate value, performs step 11;
Step 11) assembling inquiry;
Step 12) query set;
Step 13) judges whether to reach certain coverage; If so, then the method flow process terminates; If not, then perform step 14;
Step 14) judges whether that query set is empty; If so, then step 15 is performed; If not, then perform step 16;
Data are submitted in Sample Storehouse through field Sample Storehouse by step 15), after carry out the acquisition of the candidate value of step 10;
Step 16) data are crawled module through data and data record abstraction module is submitted in Sample Storehouse, after carry out the acquisition of the candidate value of step 10.
The data surface method of query interface attribute:
1) range type attribute
An interval division of we being tried to achieve is designated as O, if make the data record number in other in O except last element sub-range representated by any element be k.So O is an optimal dividing.The standard that data record number k in sub-range divides as each sub-range by we.
Arbitrary element O in O i, O ithe higher limit in representative sub-range is designated as y i, make y 0=0.Then O ijust [y can be expressed as i-1, y i].Equally, each the sub-range upper limit in sample is designated as x j, each sample sub-range can be expressed as [x j-1, x j], and make x 0=0.Sample interval [x j-1, x j] in the number of data record be designated as rnum j.So y ivalue just can calculate according to following step:
If x tfor sequence in first be greater than y ielement, the value of q is as follows:
Y ivalue formula as follows,
Obtain by the way one divides O, can make except last element, and in other sub-ranges representated by any element, data record number is k.Therefore the O tried to achieve is an optimal dividing.Corresponding range type attribute is given by the sub-range assignment representated by each element in O, and submit Query.In order to make two sub-ranges not occur simultaneously completely, we, by dividing the lower limit in each sub-range obtained on original basis, increase a minimal value.Such as i-th sub-interval is expressed as: [y i-1+ ι, y i], wherein ι is a minimal value.
2) categorical attribute
If there is multiple categorical attribute in query interface community set, then the set of multiple categorical attribute composition is designated as , a in set q,ithe codomain of element is designated as V i, .We utilize A multiand the codomain that wherein element is corresponding constructs a hierarchical tree.This hierarchical tree is a multiway tree about querying attributes.By to this level traversal of tree, complete the cutting to data record set in target WDB.The construction method of hierarchical tree is as follows.
(1) root node of a virtual tree, the total data record in this node on behalf target database.
(2) each limit sent from root node represents a q, 1a property value.I-th node on behalf of the second layer of tree is with a q, 1=v 1, ias the set of the data record that querying condition obtains.
(3) if the data record number of query hit is 0, then sky node is designated as.If the data record number of query hit is less than or equal to k and is greater than 0, be then labeled as effective leaf node.Otherwise, if the data record number of hit is greater than k, be then labeled as overflow node.
(4) respectively using the overflow node of the second layer in hierarchical tree as root node, according to identical method, second categorical attribute a is selected q, 2in candidate value hierarchical tree is expanded.
(5) hierarchical tree is expanded in the same way, until there is not the leaf node of overflow in the hierarchical tree built, or A multiin there is not the attribute be not traversed.
Attribute in sequence of attributes that and if only if arranges according to the size ascending order of its Value space, namely time, the hierarchical tree of structure is optimum.Inquiry can be made to submit least number of times to.
3) text-type attribute
We obtain the candidate value of text-type attribute by the following method.If can be come to the surface segment data record by other types attribute, then by this part data record stored in Sample Storehouse S.In statistics S, data are recorded in the distribution on text type attribute candidate value, and by statistic record stored in collection of queues Que.For the arbitrary element in Que, , be text-type attribute a q,icandidate value queue, the probability descending sort that the element in queue occurs in Sample Storehouse according to it.
The statistics of candidate value probability is based upon on the basis of enough Sample Storehouses, and when not obtaining any data record in target WDB as Sample Storehouse, we select data record in other data sources of same area as Sample Storehouse.Because structurized web data storehouse is most relevant to field.The database of same area often has identical or similar attribute and property value distribution.
In Sample Storehouse, the data record in target data source is more, and the distribution of property value is more recorded in the true distribution on this attribute close to data in target WDB.Therefore, submit to the data record returned be taken into structurized data and added in Sample Storehouse by inquiring about at every turn, the quality of sample can be improved.If this result is by a q,iobtain that a candidate value obtains as querying condition, then, when upgrading Sample Storehouse, to upgrade except Que ithe frequency of occurrences of candidate value in other queues in addition.
If submit sequence for categorical attribute to according to Value space size ascending order arrangement generated query, then search space is minimum in this case.And text-type attribute is owing to can not obtain its accurate codomain, so our number of candidate value of occurring in Sample Storehouse using each text-type attribute is as the Value space size of this attribute.Identical with classification attribute, the attribute that prioritizing selection Value space is minimum when inquiring about submission.
For any queue , the element of queue tail is the candidate value that probability of occurrence is less in property value Sample Storehouse.This just means, if using this part candidate value as querying condition, will return less data record.If give up queue Que isome elements of afterbody, will the efficiency of coming to the surface of significant increase data.Utilize a kind of method of unit coverage rate to decide to give up which property value herein, and then improve the efficiency of coming to the surface of data.The screening step of candidate value is as follows:
(1) text-type attribute a is calculated q,iadding inquiry to submits to the overflow before community set to inquire about the data record number of hitting, and is designated as num overflow.If num validfor text type attribute gets the sum that different candidate values adds the data record hit in search sequence respectively to, num validinitial value be 0.
(2) if Que ithere is not the element be not traversed, then the data surface on this attribute terminates, otherwise carry out step 3.
(3) from sequence Que inever accessed first element of middle selection is as the value of text-type attribute.Add inquiry to submit in sequence.The data record number of this query hit is assigned to temporary variable num tmp.
(4) by num tmpvalue and original num validvalue is added, and result is assigned to num valid.If , then the data surface on this attribute terminates.Otherwise, get back to step 2.
When we can think that the data record come to the surface when text type attribute is greater than ratio C, the candidate value be not traversed will be rejected.C is claimed to be that query unit is come to the surface proportion threshold value.
Above-described embodiment, just in order to technical conceive of the present invention and feature are described, its objective is and is one of ordinary skilled in the art can be understood content of the present invention and implement according to this, can not limit the scope of the invention with this.The change of every equivalence done by the essence of content of the present invention or modification, all should be encompassed in protection scope of the present invention.

Claims (2)

1., based on the DeepWeb data surface method of query interface attributive character, it is characterized in that, comprise the following steps:
Step 1) query interface pattern information extracts;
Step 2) clean the irrelevant attribute of inquiry;
Step 3) cleaning rubbish property value;
Step 4) attributive classification;
Step 5) determines whether range type attribute, if so, performs step 6; If not, perform step 7;
Step 6) utilizes range type attribute to sample, and according to the distribution of sample on interval, the range type property value domain classification method based on sampling divides after between range type attribute area and performs step 11;
Step 7) determines whether categorical attribute, if so, performs step 8; If not, perform step 9;
Step 8) candidate value extracts, and utilizes the codomain of multiple categorical attribute and correspondence thereof to build a hierarchical tree, carries out overflow inquiry detection, if overflow inquiry, performs step 9; If not overflow inquiry, perform step 11;
Step 9) determines whether text-type attribute, if so, performs step 10;
Step 10) obtains candidate value, screens respectively based on coverage rate and mutual information to candidate value, performs step 11;
Step 11) assembling inquiry;
Step 12) query set;
Step 13) judges whether to reach certain coverage; If so, then the method flow process terminates; If not, then perform step 14;
Step 14) judges whether that query set is empty; If so, then step 15 is performed; If not, then perform step 16;
Data are submitted in Sample Storehouse through field Sample Storehouse by step 15), after carry out the acquisition of the candidate value of step 10;
Step 16) data are crawled module through data and data record abstraction module is submitted in Sample Storehouse, after carry out the acquisition of the candidate value of step 10.
2. the DeepWeb data surface method based on query interface attributive character according to claim 1, it is characterized in that, the construction method of hierarchical tree is as follows:
A. the root node of a virtual tree, the total data record in this node on behalf target database;
B. each limit sent from root node represents a q, 1a property value; I-th node on behalf of the second layer of tree is with a q, 1=v 1, ias the set of the data record that querying condition obtains;
If c. the data record number of query hit is 0, be then designated as sky node; If the data record number of query hit is less than or equal to k and is greater than 0, be then labeled as effective leaf node; Otherwise, if the data record number of hit is greater than k, be then labeled as overflow node;
D. respectively using the overflow node of the second layer in hierarchical tree as root node, according to identical method, second categorical attribute a is selected q, 2in candidate value hierarchical tree is expanded;
E. hierarchical tree is expanded in the same way, until there is not the leaf node of overflow in the hierarchical tree built, or A multiin there is not the attribute be not traversed;
Attribute in sequence of attributes that and if only if arranges according to the size ascending order of its Value space, namely time, the hierarchical tree of structure is optimum; Inquiry can be made to submit least number of times to;
A is the abbreviation of attribute, and q is that to represent this attribute a as the subscript of a be attribute in query interface in the abbreviation of inquiry; Vi is i-th value in the codomain of attribute, for the set of multiple categorical attribute composition, k is the k of top-k inquiry, and a namely inquiry only returns k result at most.
3. the DeepWeb data surface method based on query interface attributive character according to claim 1, it is characterized in that, the screening step of candidate value is as follows:
A. text-type attribute a is calculated q,iadding inquiry to submits to the overflow before community set to inquire about the data record number of hitting, and is designated as num overflow; If num validfor text type attribute gets the sum that different candidate values adds the data record hit in search sequence respectively to, num validinitial value be 0;
If b. Que ithere is not the element be not traversed, then the data surface on this attribute terminates, otherwise carries out step c;
C. from sequence Que inever accessed first element of middle selection is as the value of text-type attribute; Adding inquiry to submits in sequence; The data record number of this query hit is assigned to temporary variable num tmp;
D. by num tmpvalue and original num validvalue is added, and result is assigned to num valid; If , then the data surface on this attribute terminates; Otherwise, get back to step b;
A q,ifor the text-type attribute of i-th in query interface, q is that to represent this attribute a as the subscript of a be attribute in query interface in the abbreviation of inquiry, and i is sequence number, Que ifor i-th in queue Que, C is that query unit is come to the surface proportion threshold value.
CN201210191981.3A 2012-06-12 2012-06-12 Deep web data based on query interface attributive character is come to the surface method Expired - Fee Related CN103257981B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210191981.3A CN103257981B (en) 2012-06-12 2012-06-12 Deep web data based on query interface attributive character is come to the surface method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210191981.3A CN103257981B (en) 2012-06-12 2012-06-12 Deep web data based on query interface attributive character is come to the surface method

Publications (2)

Publication Number Publication Date
CN103257981A CN103257981A (en) 2013-08-21
CN103257981B true CN103257981B (en) 2016-04-13

Family

ID=48961910

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210191981.3A Expired - Fee Related CN103257981B (en) 2012-06-12 2012-06-12 Deep web data based on query interface attributive character is come to the surface method

Country Status (1)

Country Link
CN (1) CN103257981B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104281714A (en) * 2014-10-29 2015-01-14 南通大学 Hospital portal website clinic specialist information extracting system
CN105528414B (en) * 2015-12-04 2019-07-05 北京航空航天大学 A kind of crawler method and system for collecting deep network data complete or collected works
CN105512484B (en) * 2015-12-10 2019-03-19 湘潭大学 A kind of data correlation method using characteristic value similarity
CN109446440B (en) * 2018-10-08 2021-02-05 暨南大学 Deep network query interface integration method, system, computing device and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101320370A (en) * 2008-05-16 2008-12-10 崔志明 Deep layer web page data source sort management method based on query interface connection drawing
CN101582074A (en) * 2009-01-21 2009-11-18 东北大学 Method for extracting data of DeepWeb response webpage
CN101667201A (en) * 2009-09-18 2010-03-10 浙江大学 Integration method of Deep Web query interface based on tree merging

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7529740B2 (en) * 2006-08-14 2009-05-05 International Business Machines Corporation Method and apparatus for organizing data sources

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101320370A (en) * 2008-05-16 2008-12-10 崔志明 Deep layer web page data source sort management method based on query interface connection drawing
CN101582074A (en) * 2009-01-21 2009-11-18 东北大学 Method for extracting data of DeepWeb response webpage
CN101667201A (en) * 2009-09-18 2010-03-10 浙江大学 Integration method of Deep Web query interface based on tree merging

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
DeepWeb查询接口的自动判定;高岭等;《计算机技术与发展》;20070531;148-151 *
基于层次树模型的DeepWeb数据提取方法;田建伟 等;《计算机研究与发展》;20111231;94-102 *

Also Published As

Publication number Publication date
CN103257981A (en) 2013-08-21

Similar Documents

Publication Publication Date Title
US7672943B2 (en) Calculating a downloading priority for the uniform resource locator in response to the domain density score, the anchor text score, the URL string score, the category need score, and the link proximity score for targeted web crawling
CN101364239B (en) Method for auto constructing classified catalogue and relevant system
CN101273350B (en) Click distance determination
US8959091B2 (en) Keyword assignment to a web page
US20060095430A1 (en) Web page ranking with hierarchical considerations
US20120047180A1 (en) Method and system for processing a group of resource identifiers
CN102306176B (en) On-line analytical processing (OLAP) keyword query method based on intrinsic characteristic of data warehouse
CN103324645A (en) Method and device for recommending webpage
CN1755682A (en) System and method for ranking search results using link distance
US20130046747A1 (en) Synthesizing directories, domains, and subdomains
CN101894170A (en) Semantic relationship network-based cross-mode information retrieval method
CN105512143A (en) Method and device for web page classification
CN102411617B (en) Method for storing and inquiring a large quantity of URLs
CN104133868B (en) A kind of strategy integrated for the classification of vertical reptile data
CN102955810B (en) A kind of Web page classification method and equipment
CN103257981B (en) Deep web data based on query interface attributive character is come to the surface method
CN103714149A (en) Self-adaptive incremental deep web data source discovery method
Vieira et al. Finding seeds to bootstrap focused crawlers
CN103116635A (en) Field-oriented method and system for collecting invisible web resources
CN103955480A (en) Method and equipment for determining target object information corresponding to user
CN103279525B (en) A kind of Multi-condition linkage searching method optimized based on Hash
CN105808761A (en) Solr webpage sorting optimization method based on big data
KR20120020558A (en) Folksonomy-based personalized web search method and system for performing the method
CN105159899A (en) Searching method and searching device
Umagandhi et al. Time dependent approach for query and url recommendations using search engine query logs

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20160413

Termination date: 20210612