CN103970898A - Method and device for extracting information based on multistage rule base - Google Patents
Method and device for extracting information based on multistage rule base Download PDFInfo
- Publication number
- CN103970898A CN103970898A CN201410227611.XA CN201410227611A CN103970898A CN 103970898 A CN103970898 A CN 103970898A CN 201410227611 A CN201410227611 A CN 201410227611A CN 103970898 A CN103970898 A CN 103970898A
- Authority
- CN
- China
- Prior art keywords
- webpage
- information
- rule
- module
- tree
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/3332—Query translation
- G06F16/3335—Syntactic pre-processing, e.g. stopword elimination, stemming
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/335—Filtering based on additional data, e.g. user or group profiles
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/355—Class or cluster creation or modification
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Information Transfer Between Computers (AREA)
Abstract
A method for extracting information based on a multistage rule base comprises the steps that (1) a URL address of web pages is obtained; (2) the web pages corresponding to the URL address are downloaded; (3) a web page tree-type structure chart is obtained; (4) web page clustering is conducted, web pages are selected from the web pages to be clustered to serve as a training set, and a clustering rule of the web pages is defined according to a robot learning method; (5) a searching result is extracted; (6) information is collected and displayed. After the web page tree-type structure chart is obtained in the step (3) and the web pages are clustered in the step (4), the recall ratio of the retrieved information can be effectively increased, the clustering rule is automatically generated by means of robot learning in a training set mode, manual clustering is not needed, the automation degree of searching is effectively increased, and the condition of large-area use is achieved on the premise that the recall ratio is guaranteed. According to a device for extracting the information based on the multistage rule base, a hardware foundation is provided for an information extraction process, cost is low, and the device is suitable for large-scale use.
Description
Technical field
The present invention relates to computer search engine technique field, particularly a kind of information extracting method and device.
Background technology
Along with spread and the application of computing machine and network, the whole world has all entered the large information age, and for the large information age, information search engine becomes requisite gordian technique.The information search method that current information search engine adopts has following four kinds:
1, the information extraction technology based on HTML structure; This technology completes information extraction according to the design feature of HTML, and the tree structure by DOM model is the extraction of information in webpage is equivalent to the extraction to nodal information in tree structure.Shortcoming: can cause when excessive cannot information extraction when the page changes;
2, the WEB information extraction technology based on natural language; This technology has been ignored structure of web page, does not consider webpage label factor, only according to existing contact between natural language itself, web page text information is analyzed.Shortcoming: information extraction speed is slow, when processing multiagent WEB document, if main body is not carried out to piece division, easily causes information extraction failure;
3, the information extraction technology based on body (Ontology); By related notion, attribute, relation, constraint and term etc. in this field, formed, mainly utilize the descriptor of body to data in this field, do not considering in the page structure situation of WEB, only according to the feature of data semantic, realize information extraction.Shortcoming: although the method dirigibility and strong adaptability, its automaticity is low;
4, the information extraction technology based on wrapper (Wrapper) study; After professional internet developer's analyzing web site structure, the program of hand-coding wrapper, the wrapper of writing out can only be for a class webpage.Shortcoming: for a large amount of webpages, just need to analyze a large amount of structures, and the complicated structure of a lot of websites, even for professional, the time of the writing cost of each wrapper is very huge, and people spend very large energy in website structure analysis with above program debug.
Above 4 kinds of modes are summarized, can find: the method not high to html document Structure Dependence, although its automaticity is high, cannot process baroque webpage, and the accuracy of its extraction is lower, practicality is poor; The method high to html document Structure Dependence, can process the webpage of labyrinth, but its automaticity is low, and it is high to rely on the artificial information extraction mode extraction accuracy participating in, but automaticity is low, the information extraction mode that automaticity is high has the drawback of the low poor practicability of accuracy conventionally.
Summary of the invention
One object of the present invention is just to provide a kind of information extracting method based on multistage rule base, and it can complete information search and extract under the prerequisite by artificial cluster not, has significantly improved the automaticity of search engine; Meanwhile, it can analyze cluster to the info web searching automatically, has significantly improved the recall ratio of information.
This object of the present invention is to realize by such technical scheme, and it includes following steps:
1) inputted search key word, obtains all webpage URL addresses relevant to key word;
2) according to step 1) in the webpage URL address that obtains, the webpage that download URL address is corresponding;
3) to step 2) in the webpage downloaded carry out pre-service, obtain webpage tree figure;
4) according to step 3) in the webpage tree figure that obtains, carry out webpage cluster, from webpage to be clustered, choose webpage as training set, by machine learning method, obtain web page template and define the clustering rule of webpage;
5) Search Results extracts, and according to the key word of input, adopts XPath rule location node, then adopts XSLT rule to carry out information extraction;
6) according to step 5) in the result extracted, the information of extracting in dissimilar webpage is gathered to demonstration.
Further, step 1), be correlated with as same or similar with key word.
Further, the method for down loading step 2) is reptile method for down loading.
Further, step 3) webpage pre-service described in, the concrete grammar that obtains webpage tree figure is:
3-1) to step 2) in the webpage downloaded carry out Web Cleanout, the html text that does not meet standard is converted to the text that meets XML standard, and washes unallowable instruction digit and the mistake of absconding;
3-2) to step 3-1) result that obtains is carried out DOM parsing, by XML standard text resolution, is document object Document;
3-3) structure of web page graphically shows, document object Document is graphically shown as to Dom tree, by tree construction, structure of web page is analyzed and the extraction to host node information.
Further, step 3-2), XML standard text is resolved to as adopts DOM4j or jdom kit.
The concrete generation method of clustering rule further, step 4) is:
4-1) webpage similarity is calculated, and adopts tree Path Matching Algorithm to calculate webpage similarity, forms similarity matrix;
4-2) by clustering algorithm, webpage is carried out to cluster, clustering algorithm adopts the agglomerative algorithm of cohesion level, and bunch spacing tolerance in agglomerative algorithm adopts an average chain method to calculate, average chain method be input as step 4-1) in the similarity matrix that forms.
Further, step 4-1) and step 4-2) specific formula for calculation be:
Wherein, h
ithe all set of paths that represent webpage, p
ikfor h
iin one tree path, bp (p
jk) expression p
jkwith respect to h
ibest matching path, sim (h
i, h
j) represent the similarity of webpage, on (h
i) expression h
itree total number of paths, pn (h
j) expression h
jtree total number of paths.The codomain of structure of web page similarity is [0,1], and it is more similar that its value more approaches the structure of two webpages of 1 expression;
Wherein, n
ia bunch c
ithe number of middle object, n
ja bunch c
jthe number of middle object.
Further, step 5) rule of XSLT described in adopts Rule Generation Algorithm to obtain from template webpage, and the node that is input as message block father node of Rule Generation Algorithm, is output as XSLT rule.
Another object of the present invention is just to provide a kind of information extracting device based on multistage rule base, and it can realize the full-automation search of information, and the info web searching is analyzed to cluster, has significantly improved the recall ratio of information.
This object of the present invention is to realize by such technical scheme, and it includes, and URL address acquisition module, web page code acquisition module, webpage pretreatment module, webpage cluster module, info web extraction module, information display module, clustering rule are set up module, information extraction rule is set up module, webpage clustering rule storehouse and information extraction rule base;
URL address acquisition module is obtained the URL address of related web page according to search key, URL address information is sent to web page code module;
Web page code module, according to URL address information downloading web pages, is sent to webpage pretreatment module by the info web of download;
Webpage pretreatment module is carried out pre-service to info web, obtains webpage tree figure, and webpage tree figure is sent to webpage clustering apparatus;
Webpage clustering apparatus, according to the information in webpage clustering rule storehouse, carries out webpage cluster to the webpage in webpage tree, and the info web after cluster is sent to info web extraction module, and the information in webpage clustering rule storehouse is set up module by clustering rule and generated;
The info web of info web extraction module after to cluster carries out information extraction, the information of extraction is sent to information display module, information extraction rule base provides information extraction rule for info web extraction module, and the information extraction rule in information extraction rule base is set up module by information extraction rule and generated;
The information that information extraction modules display web page information extraction modules sends.
Owing to having adopted technique scheme, the present invention has advantages of as follows:
Information extracting method based on multistage rule base of the present invention, realizes information extraction by 6 steps: 1) obtain webpage URL address; 2) webpage corresponding to download URL address; 3) obtain webpage tree figure; 4) carry out webpage cluster, from webpage to be clustered, choose webpage as training set, by machine learning method, obtain web page template and define the clustering rule of webpage; 5) Search Results extracts; 6) information gathers demonstration.Step 3 wherein) generating web page tree and step 4) in webpage cluster after, the information recall ratio retrieving can effectively improve, and step 4) clustering rule in is by the mode of training set, by machine learning, automatically generate, do not need cluster manually, the automaticity that has effectively improved search, is guaranteeing under the prerequisite of recall ratio, has the condition that large area is used.Information extracting device based on multistage rule base of the present invention, for information extraction flow process provides hardware foundation, its low price, is applicable to extensive use.
Other advantages of the present invention, target and feature will be set forth to a certain extent in the following description, and to a certain extent, based on will be apparent to those skilled in the art to investigating below, or can be instructed from the practice of the present invention.Target of the present invention and other advantages can be realized and be obtained by instructions and claims below.
Accompanying drawing explanation
Accompanying drawing of the present invention is described as follows.
Fig. 1 is information extraction schematic flow sheet of the present invention;
Fig. 2 is apparatus structure schematic diagram of the present invention.
Embodiment
Below in conjunction with drawings and Examples, the invention will be further described.
An information extracting method based on multistage rule base, concrete steps are as follows:
1) URL address acquisition.First adopt the mode of search sequence to search for the related web page of search key, obtain the URL address of webpage.The all URLs address relevant to search sequence contained in the URL address herein obtaining, and is a large amount of addresses, non-single address.
2) page download.Acquired webpage URL address Adoption Network crawler technology is downloaded to related web page code.
3) webpage pre-service.The webpage having obtained is processed to the Dom Tree of the standard that obtains.Comprise: Web Cleanout, DOM resolve and structure of web page graphically shows.
Web Cleanout refers to: html page reparation is converted into the XML document that meets standard.Because HTML does not strictly observe XHTML standard, so the mistake of absconding may appear unallowable instruction digit and in a page, Web Cleanout is mainly that these mistakes are revised, and avoids occurring parse error.
DOM resolves and refers to: XML format text is resolved to document object Document, for example, can adopt analytical tool DOM4j or jdom to resolve XML format text, to obtain document object.
The graphical demonstration of structure of web page refers to: the graphical demonstration of text object is obtained to Dom tree, by tree construction, structure of web page is analyzed and the extraction to host node information.
4) webpage cluster.From webpage to be clustered, choose a part of webpage as training set, by machine learning method, obtain web page template and define the clustering rule of webpage.Specifically comprise:
Similarity calculating method is chosen: average chain method obtains bunch spacing need to set up similarity matrix, therefore first need to calculate the similarity between webpage, and the similarity calculating method that the present invention adopts is tree Path Matching Algorithm, the method is than tree edit distance algorithm, its complexity is lower, and institute takes time still less.
Clustering algorithm is chosen: what Web Page Clustering Algorithm herein adopted is Agglomerative Hierarchical Clustering algorithm, and the tolerance of bunch spacing adopts average chain method, and the end condition that cluster finishes is that the distance when between any Liang Ge family is greater than given threshold value Q.
Similarity algorithm formula is as follows:
Wherein, h
ithe all set of paths that represent webpage, p
ikfor h
iin one tree path, bp (p
jk) expression p
jkwith respect to h
ibest matching path, sim (h
i, h
j) represent the similarity of webpage, pn (h
i) expression h
itree total number of paths, pn (h
j) expression h
jtree total number of paths.
Average chain method formula is as follows:
Wherein, n
ia bunch c
ithe number of middle object, n
ja bunch c
jthe number of middle object.
5) info web extracts.The dissimilar webpage obtaining for webpage cluster, takes specific information extraction rule to extract info web.
Information extraction rule obtains: information extraction rule adopts XSLT to describe, and accurately locates the position of information node to be extracted with XPath in XHTML document.Because automated manner definition rule accuracy is lower, so Rule Extraction herein adopts manual intervention mode to obtain.For example: this class webpage of respective column tabular form, first choose the template webpage that can reflect this class structure of web page feature, adopt the father node of block information key in XPATH locating template webpage, then according to certain Rule Extraction Algorithm, extracting rule that can obtaining information.The father node that is specifically input as block information key of this algorithm, is output as XSLT file.
Information extraction rule obtains: that information extraction rule adopts is XSLT, accurately locates the position of information node to be extracted with XPath in XHTML document.Because automated manner definition rule accuracy is lower, so Rule Extraction herein adopts manual intervention mode to obtain.
XSLT Rule mode is: extracting rule is to adopt certain Rule Generation Algorithm to obtain from template webpage, and therefore dissimilar webpage, exists its corresponding XSLT rule.Rule Generation Algorithm is one section of existing program, and the node that is input as message block father node of program, is output as XSLT rule.Template webpage is to have typical structure in a class webpage, can reflect the webpage of such webpage characteristic feature.
6) information shows.Carry out, after information extraction, the information of extracting in dissimilar webpage being gathered and being shown completing webpage.
The existing information extracting method based on structure of web page, although its accuracy is high, automaticity is relatively low, this method is intended to meet under the prerequisite of certain information extraction accuracy, improves information extraction automaticity, and recall ratio.Proposition is carried out cluster analysis to all webpages that inquire by search sequence, has improved the recall ratio of information.The dissimilar webpage of proposition after to cluster extracts web page contents according to different information extracting methods, improved information extraction automaticity, and because being adopts specific extraction rule to the webpage of certain kinds, therefore in information extraction accuracy rate, also obtained certain improvement
An information extracting device based on multistage rule base, includes that URL address acquisition module, web page code acquisition module, webpage pretreatment module, webpage cluster module, info web extraction module, information display module, clustering rule are set up module, information extraction rule is set up module, webpage clustering rule storehouse and information extraction rule base;
URL address acquisition module is obtained the URL address of related web page according to search key, URL address information is sent to web page code module;
Web page code module, according to URL address information downloading web pages, is sent to webpage pretreatment module by the info web of download;
Webpage pretreatment module is carried out pre-service to info web, obtains webpage tree figure, and webpage tree figure is sent to webpage clustering apparatus;
Webpage clustering apparatus, according to the information in webpage clustering rule storehouse, carries out webpage cluster to the webpage in webpage tree, and the info web after cluster is sent to info web extraction module, and the information in webpage clustering rule storehouse is set up module by clustering rule and generated;
The info web of info web extraction module after to cluster carries out information extraction, the information of extraction is sent to information display module, information extraction rule base provides information extraction rule for info web extraction module, and the information extraction rule in information extraction rule base is set up module by information extraction rule and generated;
The information that information extraction modules display web page information extraction modules sends.
Information extracting device based on multistage rule base of the present invention, for information extraction flow process provides hardware foundation, its low price, is applicable to extensive use.
Finally explanation is, above embodiment is only unrestricted in order to technical scheme of the present invention to be described, although the present invention is had been described in detail with reference to preferred embodiment, those of ordinary skill in the art is to be understood that, can modify or be equal to replacement technical scheme of the present invention, and not departing from aim and the scope of the technical program, it all should be encompassed in the middle of claim scope of the present invention.
Claims (9)
1. the information extracting method based on multistage rule base, is characterized in that, said method comprising the steps of:
1) inputted search key word, obtains all webpage URL addresses relevant to key word;
2) according to step 1) in the webpage URL address that obtains, the webpage that download URL address is corresponding;
3) to step 2) in the webpage downloaded carry out pre-service, obtain webpage tree figure;
4) according to step 3) in the webpage tree figure that obtains, carry out webpage cluster, from webpage to be clustered, choose webpage as training set, by machine learning method, obtain web page template and define the clustering rule of webpage;
5) Search Results extracts, and according to the key word of input, adopts XPath rule location node, then adopts XSLT rule to carry out information extraction;
6) according to step 5) in the result extracted, the information of extracting in dissimilar webpage is gathered to demonstration.
2. a kind of information extracting method based on multistage rule base as claimed in claim 1, is characterized in that step 1) described in be correlated with as same or similar with key word.
3. a kind of information extracting method based on multistage rule base as claimed in claim 1, is characterized in that step 2) described in method for down loading be reptile method for down loading.
4. a kind of information extracting method based on multistage rule base as claimed in claim 1, is characterized in that step 3) described in webpage pre-service, the concrete grammar that obtains webpage tree figure is:
3-1) to step 2) in the webpage downloaded carry out Web Cleanout, the html text that does not meet standard is converted to the text that meets XML standard, and washes unallowable instruction digit and the mistake of absconding;
3-2) to step 3-1) result that obtains is carried out DOM parsing, by XML standard text resolution, is document object Document;
3-3) structure of web page graphically shows, document object Document is graphically shown as to Dom tree, by tree construction, structure of web page is analyzed and the extraction to host node information.
5. a kind of information extracting method based on multistage rule base of finding as claim 4, is characterized in that step 3-2) in DOM4j or jdom kit are resolved as adopted to XML standard text.
6. a kind of information extracting method based on multistage rule base as claimed in claim 1, is characterized in that step 4) described in the concrete generation method of clustering rule be:
4-1) webpage similarity is calculated, and adopts tree Path Matching Algorithm to calculate webpage similarity, forms similarity matrix;
4-2) by clustering algorithm, webpage is carried out to cluster, clustering algorithm adopts the agglomerative algorithm of cohesion level, and bunch spacing tolerance in agglomerative algorithm adopts an average chain method to calculate, average chain method be input as step 4-1) in the similarity matrix that forms.
7. a kind of information extracting method based on multistage rule base as claimed in claim 6, is characterized in that step 4-1) and step 4-2) specific formula for calculation be:
Wherein, h
ithe all set of paths that represent webpage, p
ikfor h
iin one tree path, bp (p
jk) expression p
jkwith respect to h
ibest matching path, sim (h
i, h
j) represent the similarity of webpage, pn (h
i) expression h
itree total number of paths, pn (h
j) expression h
jtree total number of paths.The codomain of structure of web page similarity is [0,1], and it is more similar that its value more approaches the structure of two webpages of 1 expression;
Wherein, n
ia bunch c
ithe number of middle object, n
ja bunch c
jthe number of middle object.
8. a kind of information extracting method based on multistage rule base as claimed in claim 1, it is characterized in that, step 5) rule of XSLT described in adopts Rule Generation Algorithm to obtain from template webpage, and the node that is input as message block father node of Rule Generation Algorithm, is output as XSLT rule.
9. adopt the device that method is carried out information extraction described in claim 1-8 any one, it is characterized in that: described device includes that URL address acquisition module, web page code acquisition module, webpage pretreatment module, webpage cluster module, info web extraction module, information display module, clustering rule are set up module, information extraction rule is set up module, webpage clustering rule storehouse and information extraction rule base;
URL address acquisition module is obtained the URL address of related web page according to search key, URL address information is sent to web page code module;
Web page code module, according to URL address information downloading web pages, is sent to webpage pretreatment module by the info web of download;
Webpage pretreatment module is carried out pre-service to info web, obtains webpage tree figure, and webpage tree figure is sent to webpage clustering apparatus;
Webpage clustering apparatus, according to the information in webpage clustering rule storehouse, carries out webpage cluster to the webpage in webpage tree, and the info web after cluster is sent to info web extraction module, and the information in webpage clustering rule storehouse is set up module by clustering rule and generated;
The info web of info web extraction module after to cluster carries out information extraction, the information of extraction is sent to information display module, information extraction rule base provides information extraction rule for info web extraction module, and the information extraction rule in information extraction rule base is set up module by information extraction rule and generated;
The information that information extraction modules display web page information extraction modules sends.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410227611.XA CN103970898A (en) | 2014-05-27 | 2014-05-27 | Method and device for extracting information based on multistage rule base |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410227611.XA CN103970898A (en) | 2014-05-27 | 2014-05-27 | Method and device for extracting information based on multistage rule base |
Publications (1)
Publication Number | Publication Date |
---|---|
CN103970898A true CN103970898A (en) | 2014-08-06 |
Family
ID=51240396
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410227611.XA Pending CN103970898A (en) | 2014-05-27 | 2014-05-27 | Method and device for extracting information based on multistage rule base |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103970898A (en) |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105138546A (en) * | 2015-07-10 | 2015-12-09 | 国家电网公司 | Dom4J based IMS information equipment ledger duplicate elimination method |
CN105589918A (en) * | 2015-09-17 | 2016-05-18 | 广州市动景计算机科技有限公司 | Method and device for extracting page information |
CN106599160A (en) * | 2016-12-08 | 2017-04-26 | 网帅科技(北京)有限公司 | Content rule base management system and encoding method thereof |
WO2017173783A1 (en) * | 2016-04-07 | 2017-10-12 | 中兴通讯股份有限公司 | Method of displaying point of interest, and terminal |
CN107402912A (en) * | 2016-05-19 | 2017-11-28 | 北京京东尚科信息技术有限公司 | Parse semantic method and apparatus |
CN107808000A (en) * | 2017-11-13 | 2018-03-16 | 哈尔滨工业大学(威海) | A kind of hidden web data collection and extraction system and method |
CN109190003A (en) * | 2018-08-20 | 2019-01-11 | 上海蜜度信息技术有限公司 | For determining the method and apparatus of list page node |
CN109344341A (en) * | 2018-10-31 | 2019-02-15 | 长春理工大学 | A kind of Chinese geographical information query method and system |
CN111726336A (en) * | 2020-05-14 | 2020-09-29 | 北京邮电大学 | Method and system for extracting identification information of networked intelligent equipment |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101101600A (en) * | 2007-07-10 | 2008-01-09 | 北京大学 | Metadata automatic extraction method based on multiple rule in network search |
CN101727498A (en) * | 2010-01-15 | 2010-06-09 | 西安交通大学 | Automatic extraction method of web page information based on WEB structure |
US20110173197A1 (en) * | 2010-01-12 | 2011-07-14 | Yahoo! Inc. | Methods and apparatuses for clustering electronic documents based on structural features and static content features |
CN102289445A (en) * | 2011-06-01 | 2011-12-21 | 宇龙计算机通信科技(深圳)有限公司 | Method and device for analyzing XML (Extensible Markup Language) file and terminal |
CN102651002A (en) * | 2011-02-28 | 2012-08-29 | 腾讯科技(深圳)有限公司 | Webpage information extracting method and system |
-
2014
- 2014-05-27 CN CN201410227611.XA patent/CN103970898A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101101600A (en) * | 2007-07-10 | 2008-01-09 | 北京大学 | Metadata automatic extraction method based on multiple rule in network search |
US20110173197A1 (en) * | 2010-01-12 | 2011-07-14 | Yahoo! Inc. | Methods and apparatuses for clustering electronic documents based on structural features and static content features |
CN101727498A (en) * | 2010-01-15 | 2010-06-09 | 西安交通大学 | Automatic extraction method of web page information based on WEB structure |
CN102651002A (en) * | 2011-02-28 | 2012-08-29 | 腾讯科技(深圳)有限公司 | Webpage information extracting method and system |
CN102289445A (en) * | 2011-06-01 | 2011-12-21 | 宇龙计算机通信科技(深圳)有限公司 | Method and device for analyzing XML (Extensible Markup Language) file and terminal |
Non-Patent Citations (1)
Title |
---|
邱韬奋: ""基于聚类算法的Web信息抽取技术研究"", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
Cited By (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105138546A (en) * | 2015-07-10 | 2015-12-09 | 国家电网公司 | Dom4J based IMS information equipment ledger duplicate elimination method |
CN105138546B (en) * | 2015-07-10 | 2018-11-06 | 国家电网公司 | IMS information equipment accounts based on Dom4J disappear weighing method |
CN105589918A (en) * | 2015-09-17 | 2016-05-18 | 广州市动景计算机科技有限公司 | Method and device for extracting page information |
CN105589918B (en) * | 2015-09-17 | 2017-04-05 | 广州市动景计算机科技有限公司 | A kind of method and device for extracting page info |
WO2017173783A1 (en) * | 2016-04-07 | 2017-10-12 | 中兴通讯股份有限公司 | Method of displaying point of interest, and terminal |
CN107402912B (en) * | 2016-05-19 | 2019-12-31 | 北京京东尚科信息技术有限公司 | Method and device for analyzing semantics |
CN107402912A (en) * | 2016-05-19 | 2017-11-28 | 北京京东尚科信息技术有限公司 | Parse semantic method and apparatus |
CN106599160A (en) * | 2016-12-08 | 2017-04-26 | 网帅科技(北京)有限公司 | Content rule base management system and encoding method thereof |
CN106599160B (en) * | 2016-12-08 | 2020-06-02 | 网帅科技(北京)有限公司 | Content rule library management system and coding method thereof |
CN107808000A (en) * | 2017-11-13 | 2018-03-16 | 哈尔滨工业大学(威海) | A kind of hidden web data collection and extraction system and method |
CN109190003A (en) * | 2018-08-20 | 2019-01-11 | 上海蜜度信息技术有限公司 | For determining the method and apparatus of list page node |
CN109190003B (en) * | 2018-08-20 | 2021-03-02 | 上海蜜度信息技术有限公司 | Method and apparatus for determining list page nodes |
CN109344341A (en) * | 2018-10-31 | 2019-02-15 | 长春理工大学 | A kind of Chinese geographical information query method and system |
CN111726336A (en) * | 2020-05-14 | 2020-09-29 | 北京邮电大学 | Method and system for extracting identification information of networked intelligent equipment |
CN111726336B (en) * | 2020-05-14 | 2021-10-29 | 北京邮电大学 | Method and system for extracting identification information of networked intelligent equipment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103970898A (en) | Method and device for extracting information based on multistage rule base | |
CN108932294B (en) | Resume data processing method, device, equipment and storage medium based on index | |
CN111783394B (en) | Training method of event extraction model, event extraction method, system and equipment | |
CN103023714B (en) | The liveness of topic Network Based and cluster topology analytical system and method | |
CN103246732B (en) | A kind of abstracting method of online Web news content and system | |
CN102063488A (en) | Code searching method based on semantics | |
CN104572072B (en) | A kind of language transfer method and equipment to the program based on MVC pattern | |
CN104615724A (en) | Establishing method of knowledge base and information search method and device based on knowledge base | |
US20110314001A1 (en) | Performing query expansion based upon statistical analysis of structured data | |
CN109657068A (en) | Historical relic knowledge mapping towards wisdom museum generates and method for visualizing | |
CN105677857B (en) | method and device for accurately matching keywords with marketing landing pages | |
CN108416034B (en) | Information acquisition system based on financial heterogeneous big data and control method thereof | |
CN102567409A (en) | Method and device for providing retrieval associated word | |
CN104133855A (en) | Smart association method and device for input method | |
CN103530429A (en) | Webpage content extracting method | |
CN103399862A (en) | Method and equipment for confirming searching guide information corresponding to target query sequences | |
CN107330111A (en) | The search method and device of domain body based on common version body | |
CN104391969A (en) | User query statement syntactic structure determining method and device | |
CN103559202B (en) | A kind of webpage content extraction apparatus and method | |
CN104317845A (en) | Method and system for automatic extraction of deep web data | |
CN102004805B (en) | Webpage denoising system and method based on maximum similarity matching | |
CN113254671B (en) | Atlas optimization method, device, equipment and medium based on query analysis | |
CN114117242A (en) | Data query method and device, computer equipment and storage medium | |
US20120284224A1 (en) | Build of website knowledge tables | |
CN103377207B (en) | Microblog users relation acquisition method based on script engine |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20140806 |
|
RJ01 | Rejection of invention patent application after publication |