CN103970898A - Method and device for extracting information based on multistage rule base - Google Patents

Method and device for extracting information based on multistage rule base Download PDF

Info

Publication number
CN103970898A
CN103970898A CN201410227611.XA CN201410227611A CN103970898A CN 103970898 A CN103970898 A CN 103970898A CN 201410227611 A CN201410227611 A CN 201410227611A CN 103970898 A CN103970898 A CN 103970898A
Authority
CN
China
Prior art keywords
webpage
information
rule
module
tree
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410227611.XA
Other languages
Chinese (zh)
Inventor
张可
柴毅
马号
刘建环
田甜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University
Original Assignee
Chongqing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University filed Critical Chongqing University
Priority to CN201410227611.XA priority Critical patent/CN103970898A/en
Publication of CN103970898A publication Critical patent/CN103970898A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3335Syntactic pre-processing, e.g. stopword elimination, stemming
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification

Abstract

A method for extracting information based on a multistage rule base comprises the steps that (1) a URL address of web pages is obtained; (2) the web pages corresponding to the URL address are downloaded; (3) a web page tree-type structure chart is obtained; (4) web page clustering is conducted, web pages are selected from the web pages to be clustered to serve as a training set, and a clustering rule of the web pages is defined according to a robot learning method; (5) a searching result is extracted; (6) information is collected and displayed. After the web page tree-type structure chart is obtained in the step (3) and the web pages are clustered in the step (4), the recall ratio of the retrieved information can be effectively increased, the clustering rule is automatically generated by means of robot learning in a training set mode, manual clustering is not needed, the automation degree of searching is effectively increased, and the condition of large-area use is achieved on the premise that the recall ratio is guaranteed. According to a device for extracting the information based on the multistage rule base, a hardware foundation is provided for an information extraction process, cost is low, and the device is suitable for large-scale use.

Description

A kind of information extracting method and device based on multistage rule base
Technical field
The present invention relates to computer search engine technique field, particularly a kind of information extracting method and device.
Background technology
Along with spread and the application of computing machine and network, the whole world has all entered the large information age, and for the large information age, information search engine becomes requisite gordian technique.The information search method that current information search engine adopts has following four kinds:
1, the information extraction technology based on HTML structure; This technology completes information extraction according to the design feature of HTML, and the tree structure by DOM model is the extraction of information in webpage is equivalent to the extraction to nodal information in tree structure.Shortcoming: can cause when excessive cannot information extraction when the page changes;
2, the WEB information extraction technology based on natural language; This technology has been ignored structure of web page, does not consider webpage label factor, only according to existing contact between natural language itself, web page text information is analyzed.Shortcoming: information extraction speed is slow, when processing multiagent WEB document, if main body is not carried out to piece division, easily causes information extraction failure;
3, the information extraction technology based on body (Ontology); By related notion, attribute, relation, constraint and term etc. in this field, formed, mainly utilize the descriptor of body to data in this field, do not considering in the page structure situation of WEB, only according to the feature of data semantic, realize information extraction.Shortcoming: although the method dirigibility and strong adaptability, its automaticity is low;
4, the information extraction technology based on wrapper (Wrapper) study; After professional internet developer's analyzing web site structure, the program of hand-coding wrapper, the wrapper of writing out can only be for a class webpage.Shortcoming: for a large amount of webpages, just need to analyze a large amount of structures, and the complicated structure of a lot of websites, even for professional, the time of the writing cost of each wrapper is very huge, and people spend very large energy in website structure analysis with above program debug.
Above 4 kinds of modes are summarized, can find: the method not high to html document Structure Dependence, although its automaticity is high, cannot process baroque webpage, and the accuracy of its extraction is lower, practicality is poor; The method high to html document Structure Dependence, can process the webpage of labyrinth, but its automaticity is low, and it is high to rely on the artificial information extraction mode extraction accuracy participating in, but automaticity is low, the information extraction mode that automaticity is high has the drawback of the low poor practicability of accuracy conventionally.
Summary of the invention
One object of the present invention is just to provide a kind of information extracting method based on multistage rule base, and it can complete information search and extract under the prerequisite by artificial cluster not, has significantly improved the automaticity of search engine; Meanwhile, it can analyze cluster to the info web searching automatically, has significantly improved the recall ratio of information.
This object of the present invention is to realize by such technical scheme, and it includes following steps:
1) inputted search key word, obtains all webpage URL addresses relevant to key word;
2) according to step 1) in the webpage URL address that obtains, the webpage that download URL address is corresponding;
3) to step 2) in the webpage downloaded carry out pre-service, obtain webpage tree figure;
4) according to step 3) in the webpage tree figure that obtains, carry out webpage cluster, from webpage to be clustered, choose webpage as training set, by machine learning method, obtain web page template and define the clustering rule of webpage;
5) Search Results extracts, and according to the key word of input, adopts XPath rule location node, then adopts XSLT rule to carry out information extraction;
6) according to step 5) in the result extracted, the information of extracting in dissimilar webpage is gathered to demonstration.
Further, step 1), be correlated with as same or similar with key word.
Further, the method for down loading step 2) is reptile method for down loading.
Further, step 3) webpage pre-service described in, the concrete grammar that obtains webpage tree figure is:
3-1) to step 2) in the webpage downloaded carry out Web Cleanout, the html text that does not meet standard is converted to the text that meets XML standard, and washes unallowable instruction digit and the mistake of absconding;
3-2) to step 3-1) result that obtains is carried out DOM parsing, by XML standard text resolution, is document object Document;
3-3) structure of web page graphically shows, document object Document is graphically shown as to Dom tree, by tree construction, structure of web page is analyzed and the extraction to host node information.
Further, step 3-2), XML standard text is resolved to as adopts DOM4j or jdom kit.
The concrete generation method of clustering rule further, step 4) is:
4-1) webpage similarity is calculated, and adopts tree Path Matching Algorithm to calculate webpage similarity, forms similarity matrix;
4-2) by clustering algorithm, webpage is carried out to cluster, clustering algorithm adopts the agglomerative algorithm of cohesion level, and bunch spacing tolerance in agglomerative algorithm adopts an average chain method to calculate, average chain method be input as step 4-1) in the similarity matrix that forms.
Further, step 4-1) and step 4-2) specific formula for calculation be:
sim ( h i , h j ) = ( Σ k = 1 pn ( h i ) sim ( p ik , bp ( p ik ) ) pn ( h i ) + Σ k = 1 pn ( h j ) sim ( p jk , bp ( p jk ) ) pn ( h j ) ) ÷ 2
Wherein, h ithe all set of paths that represent webpage, p ikfor h iin one tree path, bp (p jk) expression p jkwith respect to h ibest matching path, sim (h i, h j) represent the similarity of webpage, on (h i) expression h itree total number of paths, pn (h j) expression h jtree total number of paths.The codomain of structure of web page similarity is [0,1], and it is more similar that its value more approaches the structure of two webpages of 1 expression;
d avg ( c i , c j ) = 1 n i n j Σ p ∈ c i Σ p ′ ∈ c j | p - p ′ |
Wherein, n ia bunch c ithe number of middle object, n ja bunch c jthe number of middle object.
Further, step 5) rule of XSLT described in adopts Rule Generation Algorithm to obtain from template webpage, and the node that is input as message block father node of Rule Generation Algorithm, is output as XSLT rule.
Another object of the present invention is just to provide a kind of information extracting device based on multistage rule base, and it can realize the full-automation search of information, and the info web searching is analyzed to cluster, has significantly improved the recall ratio of information.
This object of the present invention is to realize by such technical scheme, and it includes, and URL address acquisition module, web page code acquisition module, webpage pretreatment module, webpage cluster module, info web extraction module, information display module, clustering rule are set up module, information extraction rule is set up module, webpage clustering rule storehouse and information extraction rule base;
URL address acquisition module is obtained the URL address of related web page according to search key, URL address information is sent to web page code module;
Web page code module, according to URL address information downloading web pages, is sent to webpage pretreatment module by the info web of download;
Webpage pretreatment module is carried out pre-service to info web, obtains webpage tree figure, and webpage tree figure is sent to webpage clustering apparatus;
Webpage clustering apparatus, according to the information in webpage clustering rule storehouse, carries out webpage cluster to the webpage in webpage tree, and the info web after cluster is sent to info web extraction module, and the information in webpage clustering rule storehouse is set up module by clustering rule and generated;
The info web of info web extraction module after to cluster carries out information extraction, the information of extraction is sent to information display module, information extraction rule base provides information extraction rule for info web extraction module, and the information extraction rule in information extraction rule base is set up module by information extraction rule and generated;
The information that information extraction modules display web page information extraction modules sends.
Owing to having adopted technique scheme, the present invention has advantages of as follows:
Information extracting method based on multistage rule base of the present invention, realizes information extraction by 6 steps: 1) obtain webpage URL address; 2) webpage corresponding to download URL address; 3) obtain webpage tree figure; 4) carry out webpage cluster, from webpage to be clustered, choose webpage as training set, by machine learning method, obtain web page template and define the clustering rule of webpage; 5) Search Results extracts; 6) information gathers demonstration.Step 3 wherein) generating web page tree and step 4) in webpage cluster after, the information recall ratio retrieving can effectively improve, and step 4) clustering rule in is by the mode of training set, by machine learning, automatically generate, do not need cluster manually, the automaticity that has effectively improved search, is guaranteeing under the prerequisite of recall ratio, has the condition that large area is used.Information extracting device based on multistage rule base of the present invention, for information extraction flow process provides hardware foundation, its low price, is applicable to extensive use.
Other advantages of the present invention, target and feature will be set forth to a certain extent in the following description, and to a certain extent, based on will be apparent to those skilled in the art to investigating below, or can be instructed from the practice of the present invention.Target of the present invention and other advantages can be realized and be obtained by instructions and claims below.
Accompanying drawing explanation
Accompanying drawing of the present invention is described as follows.
Fig. 1 is information extraction schematic flow sheet of the present invention;
Fig. 2 is apparatus structure schematic diagram of the present invention.
Embodiment
Below in conjunction with drawings and Examples, the invention will be further described.
An information extracting method based on multistage rule base, concrete steps are as follows:
1) URL address acquisition.First adopt the mode of search sequence to search for the related web page of search key, obtain the URL address of webpage.The all URLs address relevant to search sequence contained in the URL address herein obtaining, and is a large amount of addresses, non-single address.
2) page download.Acquired webpage URL address Adoption Network crawler technology is downloaded to related web page code.
3) webpage pre-service.The webpage having obtained is processed to the Dom Tree of the standard that obtains.Comprise: Web Cleanout, DOM resolve and structure of web page graphically shows.
Web Cleanout refers to: html page reparation is converted into the XML document that meets standard.Because HTML does not strictly observe XHTML standard, so the mistake of absconding may appear unallowable instruction digit and in a page, Web Cleanout is mainly that these mistakes are revised, and avoids occurring parse error.
DOM resolves and refers to: XML format text is resolved to document object Document, for example, can adopt analytical tool DOM4j or jdom to resolve XML format text, to obtain document object.
The graphical demonstration of structure of web page refers to: the graphical demonstration of text object is obtained to Dom tree, by tree construction, structure of web page is analyzed and the extraction to host node information.
4) webpage cluster.From webpage to be clustered, choose a part of webpage as training set, by machine learning method, obtain web page template and define the clustering rule of webpage.Specifically comprise:
Similarity calculating method is chosen: average chain method obtains bunch spacing need to set up similarity matrix, therefore first need to calculate the similarity between webpage, and the similarity calculating method that the present invention adopts is tree Path Matching Algorithm, the method is than tree edit distance algorithm, its complexity is lower, and institute takes time still less.
Clustering algorithm is chosen: what Web Page Clustering Algorithm herein adopted is Agglomerative Hierarchical Clustering algorithm, and the tolerance of bunch spacing adopts average chain method, and the end condition that cluster finishes is that the distance when between any Liang Ge family is greater than given threshold value Q.
Similarity algorithm formula is as follows:
sim ( h i , h j ) = ( Σ k = 1 pn ( h i ) sim ( p ik , bp ( p ik ) ) pn ( h i ) + Σ k = 1 pn ( h j ) sim ( p jk , bp ( p jk ) ) pn ( h j ) ) ÷ 2
Wherein, h ithe all set of paths that represent webpage, p ikfor h iin one tree path, bp (p jk) expression p jkwith respect to h ibest matching path, sim (h i, h j) represent the similarity of webpage, pn (h i) expression h itree total number of paths, pn (h j) expression h jtree total number of paths.
Average chain method formula is as follows:
d avg ( c i , c j ) = 1 n i n j Σ p ∈ c i Σ p ′ ∈ c j | p - p ′ |
Wherein, n ia bunch c ithe number of middle object, n ja bunch c jthe number of middle object.
5) info web extracts.The dissimilar webpage obtaining for webpage cluster, takes specific information extraction rule to extract info web.
Information extraction rule obtains: information extraction rule adopts XSLT to describe, and accurately locates the position of information node to be extracted with XPath in XHTML document.Because automated manner definition rule accuracy is lower, so Rule Extraction herein adopts manual intervention mode to obtain.For example: this class webpage of respective column tabular form, first choose the template webpage that can reflect this class structure of web page feature, adopt the father node of block information key in XPATH locating template webpage, then according to certain Rule Extraction Algorithm, extracting rule that can obtaining information.The father node that is specifically input as block information key of this algorithm, is output as XSLT file.
Information extraction rule obtains: that information extraction rule adopts is XSLT, accurately locates the position of information node to be extracted with XPath in XHTML document.Because automated manner definition rule accuracy is lower, so Rule Extraction herein adopts manual intervention mode to obtain.
XSLT Rule mode is: extracting rule is to adopt certain Rule Generation Algorithm to obtain from template webpage, and therefore dissimilar webpage, exists its corresponding XSLT rule.Rule Generation Algorithm is one section of existing program, and the node that is input as message block father node of program, is output as XSLT rule.Template webpage is to have typical structure in a class webpage, can reflect the webpage of such webpage characteristic feature.
6) information shows.Carry out, after information extraction, the information of extracting in dissimilar webpage being gathered and being shown completing webpage.
The existing information extracting method based on structure of web page, although its accuracy is high, automaticity is relatively low, this method is intended to meet under the prerequisite of certain information extraction accuracy, improves information extraction automaticity, and recall ratio.Proposition is carried out cluster analysis to all webpages that inquire by search sequence, has improved the recall ratio of information.The dissimilar webpage of proposition after to cluster extracts web page contents according to different information extracting methods, improved information extraction automaticity, and because being adopts specific extraction rule to the webpage of certain kinds, therefore in information extraction accuracy rate, also obtained certain improvement
An information extracting device based on multistage rule base, includes that URL address acquisition module, web page code acquisition module, webpage pretreatment module, webpage cluster module, info web extraction module, information display module, clustering rule are set up module, information extraction rule is set up module, webpage clustering rule storehouse and information extraction rule base;
URL address acquisition module is obtained the URL address of related web page according to search key, URL address information is sent to web page code module;
Web page code module, according to URL address information downloading web pages, is sent to webpage pretreatment module by the info web of download;
Webpage pretreatment module is carried out pre-service to info web, obtains webpage tree figure, and webpage tree figure is sent to webpage clustering apparatus;
Webpage clustering apparatus, according to the information in webpage clustering rule storehouse, carries out webpage cluster to the webpage in webpage tree, and the info web after cluster is sent to info web extraction module, and the information in webpage clustering rule storehouse is set up module by clustering rule and generated;
The info web of info web extraction module after to cluster carries out information extraction, the information of extraction is sent to information display module, information extraction rule base provides information extraction rule for info web extraction module, and the information extraction rule in information extraction rule base is set up module by information extraction rule and generated;
The information that information extraction modules display web page information extraction modules sends.
Information extracting device based on multistage rule base of the present invention, for information extraction flow process provides hardware foundation, its low price, is applicable to extensive use.
Finally explanation is, above embodiment is only unrestricted in order to technical scheme of the present invention to be described, although the present invention is had been described in detail with reference to preferred embodiment, those of ordinary skill in the art is to be understood that, can modify or be equal to replacement technical scheme of the present invention, and not departing from aim and the scope of the technical program, it all should be encompassed in the middle of claim scope of the present invention.

Claims (9)

1. the information extracting method based on multistage rule base, is characterized in that, said method comprising the steps of:
1) inputted search key word, obtains all webpage URL addresses relevant to key word;
2) according to step 1) in the webpage URL address that obtains, the webpage that download URL address is corresponding;
3) to step 2) in the webpage downloaded carry out pre-service, obtain webpage tree figure;
4) according to step 3) in the webpage tree figure that obtains, carry out webpage cluster, from webpage to be clustered, choose webpage as training set, by machine learning method, obtain web page template and define the clustering rule of webpage;
5) Search Results extracts, and according to the key word of input, adopts XPath rule location node, then adopts XSLT rule to carry out information extraction;
6) according to step 5) in the result extracted, the information of extracting in dissimilar webpage is gathered to demonstration.
2. a kind of information extracting method based on multistage rule base as claimed in claim 1, is characterized in that step 1) described in be correlated with as same or similar with key word.
3. a kind of information extracting method based on multistage rule base as claimed in claim 1, is characterized in that step 2) described in method for down loading be reptile method for down loading.
4. a kind of information extracting method based on multistage rule base as claimed in claim 1, is characterized in that step 3) described in webpage pre-service, the concrete grammar that obtains webpage tree figure is:
3-1) to step 2) in the webpage downloaded carry out Web Cleanout, the html text that does not meet standard is converted to the text that meets XML standard, and washes unallowable instruction digit and the mistake of absconding;
3-2) to step 3-1) result that obtains is carried out DOM parsing, by XML standard text resolution, is document object Document;
3-3) structure of web page graphically shows, document object Document is graphically shown as to Dom tree, by tree construction, structure of web page is analyzed and the extraction to host node information.
5. a kind of information extracting method based on multistage rule base of finding as claim 4, is characterized in that step 3-2) in DOM4j or jdom kit are resolved as adopted to XML standard text.
6. a kind of information extracting method based on multistage rule base as claimed in claim 1, is characterized in that step 4) described in the concrete generation method of clustering rule be:
4-1) webpage similarity is calculated, and adopts tree Path Matching Algorithm to calculate webpage similarity, forms similarity matrix;
4-2) by clustering algorithm, webpage is carried out to cluster, clustering algorithm adopts the agglomerative algorithm of cohesion level, and bunch spacing tolerance in agglomerative algorithm adopts an average chain method to calculate, average chain method be input as step 4-1) in the similarity matrix that forms.
7. a kind of information extracting method based on multistage rule base as claimed in claim 6, is characterized in that step 4-1) and step 4-2) specific formula for calculation be:
sim ( h i , h j ) = ( Σ k = 1 pn ( h i ) sim ( p ik , bp ( p ik ) ) pn ( h i ) + Σ k = 1 pn ( h j ) sim ( p jk , bp ( p jk ) ) pn ( h j ) ) ÷ 2
Wherein, h ithe all set of paths that represent webpage, p ikfor h iin one tree path, bp (p jk) expression p jkwith respect to h ibest matching path, sim (h i, h j) represent the similarity of webpage, pn (h i) expression h itree total number of paths, pn (h j) expression h jtree total number of paths.The codomain of structure of web page similarity is [0,1], and it is more similar that its value more approaches the structure of two webpages of 1 expression;
d avg ( c i , c j ) = 1 n i n j Σ p ∈ c i Σ p ′ ∈ c j | p - p ′ |
Wherein, n ia bunch c ithe number of middle object, n ja bunch c jthe number of middle object.
8. a kind of information extracting method based on multistage rule base as claimed in claim 1, it is characterized in that, step 5) rule of XSLT described in adopts Rule Generation Algorithm to obtain from template webpage, and the node that is input as message block father node of Rule Generation Algorithm, is output as XSLT rule.
9. adopt the device that method is carried out information extraction described in claim 1-8 any one, it is characterized in that: described device includes that URL address acquisition module, web page code acquisition module, webpage pretreatment module, webpage cluster module, info web extraction module, information display module, clustering rule are set up module, information extraction rule is set up module, webpage clustering rule storehouse and information extraction rule base;
URL address acquisition module is obtained the URL address of related web page according to search key, URL address information is sent to web page code module;
Web page code module, according to URL address information downloading web pages, is sent to webpage pretreatment module by the info web of download;
Webpage pretreatment module is carried out pre-service to info web, obtains webpage tree figure, and webpage tree figure is sent to webpage clustering apparatus;
Webpage clustering apparatus, according to the information in webpage clustering rule storehouse, carries out webpage cluster to the webpage in webpage tree, and the info web after cluster is sent to info web extraction module, and the information in webpage clustering rule storehouse is set up module by clustering rule and generated;
The info web of info web extraction module after to cluster carries out information extraction, the information of extraction is sent to information display module, information extraction rule base provides information extraction rule for info web extraction module, and the information extraction rule in information extraction rule base is set up module by information extraction rule and generated;
The information that information extraction modules display web page information extraction modules sends.
CN201410227611.XA 2014-05-27 2014-05-27 Method and device for extracting information based on multistage rule base Pending CN103970898A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410227611.XA CN103970898A (en) 2014-05-27 2014-05-27 Method and device for extracting information based on multistage rule base

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410227611.XA CN103970898A (en) 2014-05-27 2014-05-27 Method and device for extracting information based on multistage rule base

Publications (1)

Publication Number Publication Date
CN103970898A true CN103970898A (en) 2014-08-06

Family

ID=51240396

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410227611.XA Pending CN103970898A (en) 2014-05-27 2014-05-27 Method and device for extracting information based on multistage rule base

Country Status (1)

Country Link
CN (1) CN103970898A (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105138546A (en) * 2015-07-10 2015-12-09 国家电网公司 Dom4J based IMS information equipment ledger duplicate elimination method
CN105589918A (en) * 2015-09-17 2016-05-18 广州市动景计算机科技有限公司 Method and device for extracting page information
CN106599160A (en) * 2016-12-08 2017-04-26 网帅科技(北京)有限公司 Content rule base management system and encoding method thereof
WO2017173783A1 (en) * 2016-04-07 2017-10-12 中兴通讯股份有限公司 Method of displaying point of interest, and terminal
CN107402912A (en) * 2016-05-19 2017-11-28 北京京东尚科信息技术有限公司 Parse semantic method and apparatus
CN107808000A (en) * 2017-11-13 2018-03-16 哈尔滨工业大学(威海) A kind of hidden web data collection and extraction system and method
CN109190003A (en) * 2018-08-20 2019-01-11 上海蜜度信息技术有限公司 For determining the method and apparatus of list page node
CN109344341A (en) * 2018-10-31 2019-02-15 长春理工大学 A kind of Chinese geographical information query method and system
CN111726336A (en) * 2020-05-14 2020-09-29 北京邮电大学 Method and system for extracting identification information of networked intelligent equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101101600A (en) * 2007-07-10 2008-01-09 北京大学 Metadata automatic extraction method based on multiple rule in network search
CN101727498A (en) * 2010-01-15 2010-06-09 西安交通大学 Automatic extraction method of web page information based on WEB structure
US20110173197A1 (en) * 2010-01-12 2011-07-14 Yahoo! Inc. Methods and apparatuses for clustering electronic documents based on structural features and static content features
CN102289445A (en) * 2011-06-01 2011-12-21 宇龙计算机通信科技(深圳)有限公司 Method and device for analyzing XML (Extensible Markup Language) file and terminal
CN102651002A (en) * 2011-02-28 2012-08-29 腾讯科技(深圳)有限公司 Webpage information extracting method and system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101101600A (en) * 2007-07-10 2008-01-09 北京大学 Metadata automatic extraction method based on multiple rule in network search
US20110173197A1 (en) * 2010-01-12 2011-07-14 Yahoo! Inc. Methods and apparatuses for clustering electronic documents based on structural features and static content features
CN101727498A (en) * 2010-01-15 2010-06-09 西安交通大学 Automatic extraction method of web page information based on WEB structure
CN102651002A (en) * 2011-02-28 2012-08-29 腾讯科技(深圳)有限公司 Webpage information extracting method and system
CN102289445A (en) * 2011-06-01 2011-12-21 宇龙计算机通信科技(深圳)有限公司 Method and device for analyzing XML (Extensible Markup Language) file and terminal

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
邱韬奋: ""基于聚类算法的Web信息抽取技术研究"", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105138546A (en) * 2015-07-10 2015-12-09 国家电网公司 Dom4J based IMS information equipment ledger duplicate elimination method
CN105138546B (en) * 2015-07-10 2018-11-06 国家电网公司 IMS information equipment accounts based on Dom4J disappear weighing method
CN105589918A (en) * 2015-09-17 2016-05-18 广州市动景计算机科技有限公司 Method and device for extracting page information
CN105589918B (en) * 2015-09-17 2017-04-05 广州市动景计算机科技有限公司 A kind of method and device for extracting page info
WO2017173783A1 (en) * 2016-04-07 2017-10-12 中兴通讯股份有限公司 Method of displaying point of interest, and terminal
CN107402912B (en) * 2016-05-19 2019-12-31 北京京东尚科信息技术有限公司 Method and device for analyzing semantics
CN107402912A (en) * 2016-05-19 2017-11-28 北京京东尚科信息技术有限公司 Parse semantic method and apparatus
CN106599160A (en) * 2016-12-08 2017-04-26 网帅科技(北京)有限公司 Content rule base management system and encoding method thereof
CN106599160B (en) * 2016-12-08 2020-06-02 网帅科技(北京)有限公司 Content rule library management system and coding method thereof
CN107808000A (en) * 2017-11-13 2018-03-16 哈尔滨工业大学(威海) A kind of hidden web data collection and extraction system and method
CN109190003A (en) * 2018-08-20 2019-01-11 上海蜜度信息技术有限公司 For determining the method and apparatus of list page node
CN109190003B (en) * 2018-08-20 2021-03-02 上海蜜度信息技术有限公司 Method and apparatus for determining list page nodes
CN109344341A (en) * 2018-10-31 2019-02-15 长春理工大学 A kind of Chinese geographical information query method and system
CN111726336A (en) * 2020-05-14 2020-09-29 北京邮电大学 Method and system for extracting identification information of networked intelligent equipment
CN111726336B (en) * 2020-05-14 2021-10-29 北京邮电大学 Method and system for extracting identification information of networked intelligent equipment

Similar Documents

Publication Publication Date Title
CN103970898A (en) Method and device for extracting information based on multistage rule base
CN108932294B (en) Resume data processing method, device, equipment and storage medium based on index
CN103023714B (en) The liveness of topic Network Based and cluster topology analytical system and method
CN111783394B (en) Training method of event extraction model, event extraction method, system and equipment
CN103246732B (en) A kind of abstracting method of online Web news content and system
CN104615724A (en) Establishing method of knowledge base and information search method and device based on knowledge base
US20110314001A1 (en) Performing query expansion based upon statistical analysis of structured data
CN109657068A (en) Historical relic knowledge mapping towards wisdom museum generates and method for visualizing
CN105677857B (en) method and device for accurately matching keywords with marketing landing pages
CN103838796A (en) Webpage structured information extraction method
CN108416034B (en) Information acquisition system based on financial heterogeneous big data and control method thereof
CN102567409A (en) Method and device for providing retrieval associated word
CN104133855A (en) Smart association method and device for input method
CN103778238A (en) Method for automatically building classification tree from semi-structured data of Wikipedia
CN103886094A (en) Method for error correction and expansion of electronic commerce search engine
CN103399862A (en) Method and equipment for confirming searching guide information corresponding to target query sequences
CN110008473A (en) A kind of medical text name Entity recognition mask method based on alternative manner
CN103559202B (en) A kind of webpage content extraction apparatus and method
CN104317845A (en) Method and system for automatic extraction of deep web data
CN102004805B (en) Webpage denoising system and method based on maximum similarity matching
CN106339381B (en) Information processing method and device
US20120284224A1 (en) Build of website knowledge tables
CN103377207B (en) Microblog users relation acquisition method based on script engine
CN113254671B (en) Atlas optimization method, device, equipment and medium based on query analysis
CN114564638A (en) News collection and automatic extraction method based on depth map neural network

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20140806

RJ01 Rejection of invention patent application after publication