CN107423391A - The information extracting method of Web page structural data - Google Patents

The information extracting method of Web page structural data Download PDF

Info

Publication number
CN107423391A
CN107423391A CN201710605031.3A CN201710605031A CN107423391A CN 107423391 A CN107423391 A CN 107423391A CN 201710605031 A CN201710605031 A CN 201710605031A CN 107423391 A CN107423391 A CN 107423391A
Authority
CN
China
Prior art keywords
node
path
dom
tree
reconstruct
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710605031.3A
Other languages
Chinese (zh)
Other versions
CN107423391B (en
Inventor
陈星�
张佳俊
王洲
王一洲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fuzhou University
Original Assignee
Fuzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fuzhou University filed Critical Fuzhou University
Priority to CN201710605031.3A priority Critical patent/CN107423391B/en
Publication of CN107423391A publication Critical patent/CN107423391A/en
Application granted granted Critical
Publication of CN107423391B publication Critical patent/CN107423391B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • G06F16/972Access to data in other repository systems, e.g. legacy data or dynamic Web page generation

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a kind of information extracting method of Web page structural data, web page code is pre-processed first, remove noise information, according to page layout label as node, by the nest relation and hierarchical relationship of layout tab, its dom tree is constructed, and be stored in List, beta pruning is carried out to dom tree by judging whether branch is identical, forms DOM reconstruct trees;Then node is marked by node path, and tree is reconstructed to DOM corresponding to two webpages and contrasted, determine the Path where destination object, and produce corresponding wrapper, realize automatic extract.The present invention can rapidly process a large amount of WEB contents automatically, extract correct information.

Description

The information extracting method of Web page structural data
Technical field
The invention belongs to network information processing field, more particularly to a kind of information extracting method of Web page structural data.
Background technology
Internet develops the explosive growth for bringing information rapidly, Web oneself through developing into a huge bins Storehouse, turn into and become more and more important and most potential global information transmission and shared resource.But, it is desirable to quickly and accurately from sea Required information is found in the resource of amount and is applied by other programs, just becomes a great problem.Therefore, it is necessary to application letter Data that breath extraction technique extracts structuring from substantial amounts of semi-structured information, meeting theme.Due to html web page master It is not for for manipulating and using, data therein are difficult to be employed program directly to use if for what is browsed.Therefore, Data are extracted from webpage and pass them to application program use be still a complexity, difficulty but intentionally The task of justice.
The content of the invention
In view of this, it is automatic fast it is an object of the invention to provide a kind of information extracting method of Web page structural data A large amount of WEB contents are handled fastly, and can extract correct information.
To achieve the above object, the present invention adopts the following technical scheme that:
A kind of information extracting method of Web page structural data, comprises the following steps:
A) the html web page code of the structure identical sample webpage given to two pre-processes, and removes noise information;
B) to each info web of acquisition, according to page layout label as node, by the nest relation of layout tab and Hierarchical relationship, child node is sequentially stored into, until the Text Node of innermost layer construct dom tree, and deposit as leaf node with this Enter List;
C) beta pruning reconstruct is carried out respectively to two dom trees, the content in leaf node under same branches is merged into a leaf Under node, remaining identical branch is deleted;
D) JSON strings are parsed, obtain the object information wherein included, and be stored in the List of special placing objects information, Key name and key-value pair should store;
E) Path mark is carried out respectively to two DOM reconstruct trees, travels through whole DOM reconstruct tree, contrast leaf section therein Whether point content is identical with the object information obtained in step d), and Path corresponding to the leaf node is recorded if identical;
F) Path that two sample webpages are inquired is compared, extraction wherein identical Path, obtained just The Path of true target object information;If extracting some identical Paths, increase new sample webpage, repeat Step a) to step f), until screening obtains the Path of correct target object information;
Pair g) dom tree generation and reconstruct are carried out with sample webpage structure identical target web, and carries out Path mark, time Go through whole target DOM reconstruct tree, contrast the Path of acquired target object information, judge whether it is identical, if identical, The contents of object for exporting correspondence position is target object information.
Further, the step c) is comprised the following steps that:
c1:The dom tree of beta pruning reconstruct is treated, since root node, finds the node that first son node number in dom tree is more than 1;
c2:All child nodes are judged between any two, if the child node number of current node is 0 and current two sons Node type is identical, performs cut operator;
c3:If the child node number of current node is not 0, to its child node tree recursive call dom tree restructing algorithm again;
c4:Recursive function is called to judge whether child node tree is identical, it is such as identical, call recursive function to realize and sub- node tree is cut Branch operation, finally obtain DOM reconstruct trees.
Further, Path mark includes feature tag path tag and feature digital path mark in the step e) Note.
Further, feature numeral path tag algorithm is as follows:
e1:Tree is reconstructed to the DOM in characteristic word path to be obtained, if present node M father node number is not 0, obtained simultaneously The feature digital path of its father node is stored, and in numerical digit corresponding to the memory node M of end;
e2:If present node M father node number is 0, numerical digit corresponding to memory node M;
e3:All child nodes of node M are proceeded as follows successively:If i-th of child node N of node M child node Number is not 0, to node N recursive call feature numeral path tag algorithms;If i-th of child node N of node M child node Number is 0, obtains the feature digital path of node M and is stored in node N feature digital path, and in N pairs of end memory node The numerical digit answered;
e4:Finally obtain the feature digital path of DOM reconstruct trees.
Further, in step g), object is first carried out according to the feature digital path of acquired target object information Search and extract, if do not extract contents of object according to feature digital path, further according to the feature of target object information Tag path carries out the lookup and extraction of object.
The present invention has the advantages that compared with prior art:
(1)The present invention is when constructing dom tree, and using label as node label, it is less to construct the required time, and can be fine The tree structure using dom tree represent the nesting and hierarchical relationship of former page-tag;
(2)The present invention has carried out beta pruning to dom tree, ensure that the most simple of dom tree, deposited so as to reduce in the reconstruct of dom tree Store up the utilization of resource;
(3)When positioning target information position, multiple sample webpages can be contrasted, it is thus possible to accurately obtain target The Path of object information.
Brief description of the drawings
Fig. 1 is the schematic flow sheet of the information extracting method of Web page structural data of the present invention;
Fig. 2 is dom tree restructing algorithm flow chart of the present invention;
Fig. 3 is feature of present invention digital path labeling algorithm flow chart;
Fig. 4 is the Path result figure of a sample webpage in the embodiment of the present invention;
Fig. 5 is the Path result figure of another sample webpage in the embodiment of the present invention;
Fig. 6 is the Path result figure of target information in the embodiment of the present invention.
Embodiment
Below in conjunction with the accompanying drawings and embodiment the present invention will be further described.
As shown in figure 1, the present invention provides a kind of information extracting method of Web page structural data, including:
A) the html web page code of the structure identical sample webpage given to two pre-processes, and removes noise information;
B) to each info web of acquisition, according to page layout label as node, by the nest relation of layout tab and Hierarchical relationship, child node is sequentially stored into, until the Text Node of innermost layer construct dom tree, and deposit as leaf node with this Enter List;
C) beta pruning reconstruct is carried out respectively to two dom trees, the content in leaf node under same branches is merged into a leaf Under node, remaining identical branch is deleted, as shown in Fig. 2 dom tree reconstruct comprises the following steps that:
c1:The dom tree of beta pruning reconstruct is treated, since root node, finds the node that first son node number in dom tree is more than 1;
c2:All child nodes are judged between any two, if the child node number of current node is 0 and current two sons Node type is identical, performs cut operator;
c3:If the child node number of current node is not 0, to its child node tree recursive call dom tree restructing algorithm again;
c4:Recursive function is called to judge whether child node tree is identical, it is such as identical, call recursive function to realize and sub- node tree is cut Branch operation, finally obtain DOM reconstruct trees;
D) JSON strings are parsed, obtain the object information wherein included, and be stored in the List of special placing objects information, Key name and key-value pair should store;
E) feature tag path tag and feature numeral path tag are carried out respectively to two DOM reconstruct trees, travels through whole DOM Reconstruct tree, contrast leaf node content therein whether with d) in obtain object information it is identical, the leaf is recorded if identical Path corresponding to node;
As shown in figure 3, feature numeral path tag algorithm is as follows:
e1:Tree is reconstructed to the DOM in characteristic word path to be obtained, if present node M father node number is not 0, obtained simultaneously The feature digital path of its father node is stored, and in numerical digit corresponding to the memory node M of end;
e2:If present node M father node number is 0, numerical digit corresponding to memory node M;
e3:All child nodes of node M are proceeded as follows successively:If i-th of child node N of node M child node Number is not 0, to node N recursive call feature numeral path tag algorithms;If i-th of child node N of node M child node Number is 0, obtains the feature digital path of node M and is stored in node N feature digital path, and in N pairs of end memory node The numerical digit answered;
e4:Finally obtain the feature digital path of DOM reconstruct trees.
F) Path that two sample webpages are inquired is compared, extraction wherein identical Path, obtained To the Path of correct target object information;If extracting some identical Paths, increase new sample webpage, Repeat step a) to step f), until screening obtains the Path of correct target object information;
Pair g) dom tree generation and reconstruct are carried out with sample webpage structure identical target web, and carries out Path mark, time Go through whole target DOM reconstruct tree, first according to the feature digital path of acquired target object information carry out object lookup and Extraction, if do not have object output content according to feature digital path, further according to the feature tag path of target object information Carry out the lookup and extraction of object.
To extract in bean cotyledon books webpage exemplified by " author " this contents of object,
First, with bean cotyledon books《Picked up towards sunset is spent》Webpage and bean cotyledon books《The The Romance of the Three Kingdoms》Webpage is sample webpage, in bean cotyledon figure Book《Picked up towards sunset is spent》The URL addresses of webpage are input, in the case that " Lu xun " is instance objects input, construct the DOM of the webpage Tree, and perform cut operator and form DOM reconstruct trees, when inquiring about the Path of " Lu xun " object, two Paths can be exported, Occur such case be because former webpage text in, the position of existing " author " to be obtained corresponding " Lu xun ", also one " Lu xun " that individual bean cotyledon labels to books, when whole DOM reconstruct tree is traveled through, also comply with requirement and be acquired path, Route result is as shown in figure 4, the first row and the second row in Fig. 4 are two feature tag paths where " Lu xun " object, Three rows and fourth line are its feature digital paths, and wherein the first row and the third line is the place of required target information, and the Two rows and fourth line are the positions of text identical interference informations.As can be seen here, such situation only have a sample webpage without Method correctly finds required target information, so also needing to increase in addition sample webpage and its instance objects input is carried out pair Than can just determine correct object information position.
Increase another sample webpage bean cotyledon books《The The Romance of the Three Kingdoms》Webpage, above-mentioned identical operation is performed, webpage is carried out Construction dom tree simultaneously reconstructs, and inquires about the Path of " Luo Guanzhong " object, route result as shown in figure 5, the first row in Fig. 5 and Second row is two feature tag paths where " Luo Guanzhong " object, and the third line and fourth line are its feature digital paths.
Because the two webpages, which are all bean cotyledon books, introduces webpage, basic format is all identical, thus be result in Two feature tag paths are all identical corresponding to two examples, can not thus be obtained just by contrasting its feature tag path True object path.And because feature digital path requires tightened up to webpage format, the two page layouts and form are not yet Be it is identical, it is otherwise varied so as to result in the feature digital path of the two objects, wherein still there is an identical path, This is exactly the location paths of " author " object to be looked for, and another just generates difference, therefore can be cast out.
The Path result of correct target object information is obtained more afterwards as shown in fig. 6, the path is exactly required Proper characteristics path corresponding to target " author " information(Contain feature tag path and feature digital path).
In the present embodiment, the Path of correct target information has only just been got with two sample webpages, and In the other cases, the sample webpage many more than two that may be inputted, two sample webpages are contrasted, Path identical portions Divide and take common factor, different piece takes union, and by the result compared with next sample webpage, by that analogy, passes through contrast Obtain target signature path.
Dom tree generation and reconstruct are carried out to target web, the position of object is quickly found out according to the feature digital path of acquisition Put, and extract contents of object, if feature digital path does not extract contents of object, according to feature tag path, time Go through whole target DOM reconstruct tree, feature tag path known to contrast judges whether identical, if identical, exports correspondence position Contents of object, the content is exactly required object information.
Although the present invention is disclosed as above with preferred embodiment, it is not for limiting the present invention, any this area Technical staff without departing from the spirit and scope of the present invention, may be by the methods and technical content of the disclosure above to this hair Bright technical scheme makes possible variation and modification, therefore, every content without departing from technical solution of the present invention, according to the present invention Technical spirit to any simple modifications, equivalents, and modifications made for any of the above embodiments, belong to technical solution of the present invention Protection domain.It the foregoing is only presently preferred embodiments of the present invention, all impartial changes done according to scope of the present invention patent Change and modify, should all belong to the covering scope of the present invention.

Claims (5)

1. a kind of information extracting method of Web page structural data, it is characterised in that comprise the following steps:
A) the html web page code of the structure identical sample webpage given to two pre-processes, and removes noise information;
B) to each info web of acquisition, according to page layout label as node, by the nest relation of layout tab and Hierarchical relationship, child node is sequentially stored into, until the Text Node of innermost layer construct dom tree, and deposit as leaf node with this Enter List;
C) beta pruning reconstruct is carried out respectively to two dom trees, the content in leaf node under same branches is merged into a leaf Under node, remaining identical branch is deleted;
D) JSON strings are parsed, obtain the object information wherein included, and be stored in the List of special placing objects information, Key name and key-value pair should store;
E) Path mark is carried out respectively to two DOM reconstruct trees, travels through whole DOM reconstruct tree, contrast leaf section therein Whether point content is identical with the object information obtained in step d), and Path corresponding to the leaf node is recorded if identical;
F) Path that two sample webpages are inquired is compared, extraction wherein identical Path, obtained just The Path of true target object information;If extracting some identical Paths, increase new sample webpage, repeat Step a) to step f), until screening obtains the Path of correct target object information;
Pair g) dom tree generation and reconstruct are carried out with sample webpage structure identical target web, and carries out Path mark, time Go through whole target DOM reconstruct tree, contrast the Path of acquired target object information, judge whether it is identical, if identical, The contents of object for exporting correspondence position is target object information.
2. the information extracting method of Web page structural data according to claim 1, it is characterised in that the step c's) Comprise the following steps that:
c1:The dom tree of beta pruning reconstruct is treated, since root node, finds the node that first son node number in dom tree is more than 1;
c2:All child nodes are judged between any two, if the child node number of current node is 0 and current two sons Node type is identical, performs cut operator;
c3:If the child node number of current node is not 0, to its child node tree recursive call dom tree restructing algorithm again;
c4:Recursive function is called to judge whether child node tree is identical, it is such as identical, call recursive function to realize and sub- node tree is cut Branch operation, finally obtain DOM reconstruct trees.
3. the information extracting method of Web page structural data according to claim 1, it is characterised in that in the step e) Path mark includes feature tag path tag and feature numeral path tag.
4. the information extracting method of Web page structural data according to claim 3, it is characterised in that feature digital path Labeling algorithm is as follows:
e1:Tree is reconstructed to the DOM in characteristic word path to be obtained, if present node M father node number is not 0, obtained simultaneously The feature digital path of its father node is stored, and in numerical digit corresponding to the memory node M of end;
e2:If present node M father node number is 0, numerical digit corresponding to memory node M;
e3:All child nodes of node M are proceeded as follows successively:If i-th of child node N of node M child node Number is not 0, to node N recursive call feature numeral path tag algorithms;If i-th of child node N of node M child node Number is 0, obtains the feature digital path of node M and is stored in node N feature digital path, and in N pairs of end memory node The numerical digit answered;
e4:Finally obtain the feature digital path of DOM reconstruct trees.
5. the information extracting method of Web page structural data according to claim 3, it is characterised in that in step g), first The lookup and extraction of object are carried out according to the feature digital path of acquired target object information, if according to feature numeral road When contents of object is not extracted in footpath, the lookup and extraction of object are carried out further according to the feature tag path of target object information.
CN201710605031.3A 2017-07-24 2017-07-24 Information extraction method of webpage structured data Active CN107423391B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710605031.3A CN107423391B (en) 2017-07-24 2017-07-24 Information extraction method of webpage structured data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710605031.3A CN107423391B (en) 2017-07-24 2017-07-24 Information extraction method of webpage structured data

Publications (2)

Publication Number Publication Date
CN107423391A true CN107423391A (en) 2017-12-01
CN107423391B CN107423391B (en) 2020-11-03

Family

ID=60429995

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710605031.3A Active CN107423391B (en) 2017-07-24 2017-07-24 Information extraction method of webpage structured data

Country Status (1)

Country Link
CN (1) CN107423391B (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108694242A (en) * 2018-05-14 2018-10-23 中国平安财产保险股份有限公司 Node checks method, equipment, storage medium and device based on DOM
CN109254764A (en) * 2018-09-28 2019-01-22 福州大学 The method of software architecture when the acquisition operation of curstomer-oriented end application program
CN109683906A (en) * 2018-12-25 2019-04-26 北京小米移动软件有限公司 Handle the method and device of HTML code segment
CN110059085A (en) * 2019-03-18 2019-07-26 浙江工业大学 A kind of parsing of JSON data and modeling method of web oriented 2.0
CN110874428A (en) * 2019-11-11 2020-03-10 汉口北进出口服务有限公司 Structured data extraction device and method for e-commerce page and readable storage medium
CN111651694A (en) * 2020-05-21 2020-09-11 深圳市比一比网络科技有限公司 DOM tree processing method applied to webpage
CN111698364A (en) * 2020-06-19 2020-09-22 深圳市小满科技有限公司 Contact person information extraction method and related equipment
CN112307750A (en) * 2020-10-28 2021-02-02 汇承金融科技服务(南京)有限公司 Electronic draft flaw identification method, system, equipment and storage medium
CN114528811A (en) * 2022-01-21 2022-05-24 北京麦克斯泰科技有限公司 Article content extraction method, device, equipment and storage medium
CN115658993A (en) * 2022-09-27 2023-01-31 观澜网络(杭州)有限公司 Intelligent extraction method and system for core content of webpage

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060242266A1 (en) * 2001-02-27 2006-10-26 Paula Keezer Rules-based extraction of data from web pages
US20090307256A1 (en) * 2008-06-06 2009-12-10 Yahoo! Inc. Inverted indices in information extraction to improve records extracted per annotation
CN101944094A (en) * 2009-07-06 2011-01-12 富士通株式会社 Webpage information extraction method and device thereof
CN102253937A (en) * 2010-05-18 2011-11-23 阿里巴巴集团控股有限公司 Method and related device for acquiring information of interest in webpages
CN102375847A (en) * 2010-08-17 2012-03-14 富士通株式会社 Method and device for forming merge tree for generating document template
CN102890681A (en) * 2011-07-20 2013-01-23 阿里巴巴集团控股有限公司 Method and system for generating webpage structure template
CN104572934A (en) * 2014-12-29 2015-04-29 西安交通大学 Webpage key content extracting method based on DOM
CN105653668A (en) * 2015-12-29 2016-06-08 武汉理工大学 Webpage content analysis and extraction optimization method based on DOM Tree in cloud environment

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060242266A1 (en) * 2001-02-27 2006-10-26 Paula Keezer Rules-based extraction of data from web pages
US20090307256A1 (en) * 2008-06-06 2009-12-10 Yahoo! Inc. Inverted indices in information extraction to improve records extracted per annotation
CN101944094A (en) * 2009-07-06 2011-01-12 富士通株式会社 Webpage information extraction method and device thereof
CN102253937A (en) * 2010-05-18 2011-11-23 阿里巴巴集团控股有限公司 Method and related device for acquiring information of interest in webpages
CN102375847A (en) * 2010-08-17 2012-03-14 富士通株式会社 Method and device for forming merge tree for generating document template
CN102890681A (en) * 2011-07-20 2013-01-23 阿里巴巴集团控股有限公司 Method and system for generating webpage structure template
CN104572934A (en) * 2014-12-29 2015-04-29 西安交通大学 Webpage key content extracting method based on DOM
CN105653668A (en) * 2015-12-29 2016-06-08 武汉理工大学 Webpage content analysis and extraction optimization method based on DOM Tree in cloud environment

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
XIAOYU TANG等: "Regular expression-based reference metadata extraction from the web", 《2010 IEEE 2ND SYMPOSIUM ON WEB SOCIETY》 *
张冬梅等: "基于改进DSE算法的web信息抽取", 《数字技术与应用》 *
欧健文等: "模板化网页主题信息的提取方法", 《清华大学学报 自然科学版》 *
马金娜: "基于DOM树节点重要度的WEB主题信息提取研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108694242A (en) * 2018-05-14 2018-10-23 中国平安财产保险股份有限公司 Node checks method, equipment, storage medium and device based on DOM
CN108694242B (en) * 2018-05-14 2023-03-21 中国平安财产保险股份有限公司 Node searching method, equipment, storage medium and device based on DOM
CN109254764B (en) * 2018-09-28 2022-03-15 福州大学 Method for acquiring runtime software architecture facing client application program
CN109254764A (en) * 2018-09-28 2019-01-22 福州大学 The method of software architecture when the acquisition operation of curstomer-oriented end application program
CN109683906A (en) * 2018-12-25 2019-04-26 北京小米移动软件有限公司 Handle the method and device of HTML code segment
CN110059085A (en) * 2019-03-18 2019-07-26 浙江工业大学 A kind of parsing of JSON data and modeling method of web oriented 2.0
CN110874428A (en) * 2019-11-11 2020-03-10 汉口北进出口服务有限公司 Structured data extraction device and method for e-commerce page and readable storage medium
CN111651694A (en) * 2020-05-21 2020-09-11 深圳市比一比网络科技有限公司 DOM tree processing method applied to webpage
CN111651694B (en) * 2020-05-21 2023-09-29 深圳市比一比网络科技有限公司 DOM tree processing method applied to webpage
CN111698364A (en) * 2020-06-19 2020-09-22 深圳市小满科技有限公司 Contact person information extraction method and related equipment
CN111698364B (en) * 2020-06-19 2021-09-21 深圳市小满科技有限公司 Contact person information extraction method, related device and computer readable storage medium
CN112307750A (en) * 2020-10-28 2021-02-02 汇承金融科技服务(南京)有限公司 Electronic draft flaw identification method, system, equipment and storage medium
CN114528811A (en) * 2022-01-21 2022-05-24 北京麦克斯泰科技有限公司 Article content extraction method, device, equipment and storage medium
CN114528811B (en) * 2022-01-21 2022-09-02 北京麦克斯泰科技有限公司 Article content extraction method, device, equipment and storage medium
CN115658993A (en) * 2022-09-27 2023-01-31 观澜网络(杭州)有限公司 Intelligent extraction method and system for core content of webpage

Also Published As

Publication number Publication date
CN107423391B (en) 2020-11-03

Similar Documents

Publication Publication Date Title
CN107423391A (en) The information extracting method of Web page structural data
CN108563729B (en) Bid winning information extraction method for bidding website based on DOM tree
CN103853760A (en) Method and device for extracting contents of bodies of web pages
CN104572934B (en) A kind of webpage key content abstracting method based on DOM
CN103226599B (en) A kind of method and system of accurate extraction web page contents
CN113254751B (en) Method, equipment and storage medium for accurately extracting complex webpage structured information
CN112732994B (en) Method, device and equipment for extracting webpage information and storage medium
CN103778238A (en) Method for automatically building classification tree from semi-structured data of Wikipedia
CN102760150A (en) Webpage extraction method based on attribute reproduction and labeled path
CN103123646B (en) XML document is converted into automatically conversion method and the device of OWL document
CN106547749A (en) The method and apparatus of collecting webpage data
CN105740355B (en) Webpage context extraction method and device based on aggregation text density
CN106528068A (en) Webpage content reconstruction method and system
CN103870495B (en) Method and device for extracting information from website
CN107943929B (en) Wrapper automatic generation method based on DOM tree abstraction
CN106843899A (en) A kind of web development methods and device based on Node.js platforms
CN108228656A (en) URL classification method and device based on CART decision trees
CN107193870A (en) The extracting method and system of web page contents
CN106940711A (en) A kind of URL detection methods and detection means
CN106372042B (en) A kind of document content acquisition methods and device
CN113608903A (en) Fault management method based on XML language
CN103309954A (en) Html webpage based data extracting system
Ebach et al. Assumption 2: opaque to intuition?
CN106547774B (en) Website content detection method and device
CN101576933A (en) Fully-automatic grouping method of WEB pages based on title separator

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant