CN107436931B - Webpage text extraction method and device - Google Patents

Webpage text extraction method and device Download PDF

Info

Publication number
CN107436931B
CN107436931B CN201710581136.XA CN201710581136A CN107436931B CN 107436931 B CN107436931 B CN 107436931B CN 201710581136 A CN201710581136 A CN 201710581136A CN 107436931 B CN107436931 B CN 107436931B
Authority
CN
China
Prior art keywords
webpage
text
visual
dom tree
mode
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201710581136.XA
Other languages
Chinese (zh)
Other versions
CN107436931A (en
Inventor
晋彤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yunrun Da Data Service Co ltd
Original Assignee
Yunrun Da Data Service Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yunrun Da Data Service Co ltd filed Critical Yunrun Da Data Service Co ltd
Priority to CN201710581136.XA priority Critical patent/CN107436931B/en
Publication of CN107436931A publication Critical patent/CN107436931A/en
Application granted granted Critical
Publication of CN107436931B publication Critical patent/CN107436931B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/957Browsing optimisation, e.g. caching or content distillation
    • G06F16/9577Optimising the visualization of content, e.g. distillation of HTML documents

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a webpage content extraction method and a webpage content extraction device, wherein a webpage page is downloaded, a webpage source code is obtained according to the webpage page, a DOM tree is created according to the webpage source code, a visual tree is generated based on the DOM tree and the page style of the webpage page, a visual rendering technology is adopted to render the visual tree and then generate a visual identification model, a text field is positioned based on the visual identification model, and a characteristic text is extracted based on the text field, so that the content corpus of the webpage page is obtained, the defects of manual rules and templates in the existing extraction technology can be effectively avoided, the webpage content can be effectively extracted, the compatibility is high, and the impurities are completely removed.

Description

Webpage text extraction method and device
Technical Field
The invention relates to the field of computers, in particular to a webpage text extraction method and device.
Background
In the news (or information) search field, news text extraction is an essential link, and the quality of the text extraction determines the quality and the user experience of news search. At present, news text extraction methods are various in format, and extraction is mainly performed in a template (or wrapper) based mode. Extracting based on a template mode: firstly, defining a template, and then writing a program to analyze and execute the template to obtain data. According to the template generation mode, the method can be divided into the following steps: manual template extraction and automatic template extraction. And (5) extracting the manual template. And manually writing a template aiming at the extracted target site, wherein the template can be in a regular matching mode or a simple character string matching first-order matching mode. Automatic template extraction utilizes a machine learning algorithm to acquire a part of webpage data from a target website for learning training, and then a program extracts the data by utilizing the template. The disadvantage of manually writing templates is that huge manpower resources are consumed to write templates, and the cost for maintaining templates is very high with the change of target websites. Whether the template is generated manually or automatically, the assumption is that the data of the website is generated through the template, the basic problems of some large websites are not great, namely different entrances may have different templates, but for a plurality of small and medium websites, the template is not good, only most information can be extracted by using the template extraction, and more chances are provided for containing junk information.
Disclosure of Invention
The embodiment of the invention aims to provide a webpage text extraction method and device, which can effectively avoid the defects of manual rules and templates in the prior extraction technology, have high compatibility, can effectively extract webpage contents, and have high compatibility and complete impurity removal.
In order to achieve the above object, an embodiment of the present invention provides a method for extracting a web page text, including the steps of:
downloading a webpage, and acquiring a webpage source code according to the webpage;
creating a DOM tree according to the webpage source code, and generating a visual tree based on the DOM tree and the page style of the webpage;
rendering the visual tree by adopting a visual rendering technology to generate a visual recognition model, and positioning a text field based on the visual recognition model;
and extracting a characteristic text based on the text field so as to obtain a text corpus of the webpage.
Compared with the prior art, the webpage text extraction method disclosed by the invention has the advantages that the webpage is downloaded, the webpage source code is obtained according to the webpage, the DOM tree is created according to the webpage source code, the visual tree is generated based on the DOM tree and the page style of the webpage, the visual identification model is generated after the visual tree is rendered by adopting the visual rendering technology, the text domain is positioned based on the visual identification model, and the characteristic text is extracted based on the text domain, so that the text corpus of the webpage is obtained, the defects of manual rules and templates in the prior extraction technology can be effectively avoided, the webpage content can be effectively extracted, the compatibility is high, and the impurities are completely removed.
As an improvement of the scheme, the method further comprises the following steps:
and integrating and typesetting the text corpora of the webpage according to the actual visual effect.
As an improvement of the above scheme, the extracting the feature text based on the text field to obtain the text corpus of the web page specifically includes:
based on the located text field, identifying a pattern of the text field;
according to the mode of the text field, characteristic nodes of the DOM tree are analyzed out;
and extracting a characteristic text according to the characteristic nodes of the DOM tree.
Based on the located text field, identifying a pattern of the text field;
according to the mode of the text field, characteristic nodes of the DOM tree are analyzed out;
and extracting a characteristic text according to the characteristic nodes of the DOM tree.
As an improvement of the above scheme, the mode for identifying the text field specifically includes:
and identifying the text field as a single field or multiple fields so as to perform automatic adaptation.
As an improvement of the above scheme, the mode for identifying the text field specifically includes:
performing mode training on a large number of webpage structures, and extracting a distribution model of texts on pages; wherein the distribution model adaptively learns and adds new features by input information;
analyzing and processing a DOM tree of the webpage, and carrying out block clustering on each node of the DOM tree to obtain a node clustering result;
and extracting necessary information from the node clustering result through the distribution model, and obtaining the mode of the text field through the necessary information.
The embodiment of the invention also provides a webpage text extraction device, which comprises:
the webpage source code acquisition module is used for downloading a webpage and acquiring a webpage source code according to the webpage;
the visual tree generation module is used for creating a DOM tree according to the webpage source code and generating a visual tree based on the DOM tree and the page style of the webpage;
the text domain positioning module is used for generating a visual recognition model after rendering the visual tree by adopting a visual rendering technology and positioning a text domain based on the visual recognition model;
and the text corpus acquisition module is used for extracting the characteristic text based on the text field so as to acquire the text corpus of the webpage.
Compared with the prior art, the webpage text extracting device disclosed by the invention has the advantages that the webpage is downloaded through the webpage source code acquiring module, the webpage source code is acquired according to the webpage, the DOM tree is created according to the webpage source code through the visual tree generating module, the visual tree is generated based on the DOM tree and the page style of the webpage, the visual tree is rendered through the visual rendering technology through the text domain positioning module to generate the visual identification model, the text domain is positioned based on the visual identification model, and then the characteristic text is extracted based on the text domain through the text corpus acquiring module, so that the corpus of the webpage is obtained, the defects of manual rules and templates in the prior art can be effectively avoided, the webpage content can be effectively extracted, the compatibility is high, and the impurity removal is complete.
As an improvement of the above scheme, the method further comprises the following steps:
and the integration module is used for integrating and typesetting the text corpora of the webpage according to the actual visual effect.
As an improvement of the above solution, the text corpus acquiring module includes:
an identification module for identifying a mode of a text field based on the located text field;
the characteristic node analysis module is used for analyzing the characteristic nodes of the DOM tree according to the mode of the text field;
and the characteristic text extraction module is used for extracting the characteristic text according to the characteristic nodes of the DOM tree.
As an improvement of the above scheme, the mode for identifying the text field specifically includes:
and identifying the text field as a single field or multiple fields so as to perform automatic adaptation.
As an improvement of the above solution, the identification module includes:
the distribution model extraction module is used for carrying out mode training on a large number of webpage structures and extracting a distribution model of texts on the pages; wherein the distribution model adaptively learns and adds new features by input information;
the clustering module is used for analyzing and processing the DOM tree of the webpage, and clustering each node of the DOM tree in a partitioning manner to obtain a node clustering result;
and the mode acquisition module is used for extracting necessary information from the node clustering result through the distribution model and acquiring the mode of the text field through the necessary information.
Drawings
Fig. 1 is a schematic flowchart of a method for extracting a web page text in embodiment 1 of the present invention.
Fig. 2 is a schematic structural diagram of a web page text extraction apparatus in embodiment 2 of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1, a schematic flow chart of a method for extracting a web page text provided in embodiment 1 of the present invention includes the steps of:
s1, downloading a webpage, and acquiring a webpage source code according to the webpage;
s2, creating a DOM tree according to the webpage source code, and generating a visual tree based on the DOM tree and the page style of the webpage;
s3, generating a visual recognition model after rendering the visual tree by adopting a visual rendering technology, and positioning a text field based on the visual recognition model;
and S4, extracting the characteristic text based on the text field, thereby obtaining the text corpus of the webpage.
During specific implementation, a webpage page is downloaded, a webpage source code is obtained according to the webpage page, a DOM tree is created according to the webpage source code, a visual tree is generated based on the DOM tree and the page style of the webpage page, a visual rendering technology is adopted to render the visual tree and then generate a visual identification model, a text domain is positioned based on the visual identification model, and a characteristic text is extracted based on the text domain, so that the text corpus of the webpage page is obtained, the defects of artificial rules and templates in the existing extraction technology can be effectively avoided, the webpage content can be effectively extracted, the compatibility is high, a mass collection engine of the internet can provide more automatic and intelligent text extraction and analysis, each website is prevented from being configured with a large number of parameters, and even self-learning and time spending of template analogy are avoided.
In a preferred embodiment, on the basis of embodiment 1, the method further comprises the following steps:
and integrating and typesetting the text corpora of the webpage according to the actual visual effect.
The extracted corpus materials can be completely combined and typeset according to the actual visual effect, and the readability can be increased.
In a preferred embodiment, based on embodiment 1, step S4 specifically includes:
based on the located text field, identifying a pattern of the text field;
according to the mode of the text field, characteristic nodes of the DOM tree are analyzed out;
and extracting a characteristic text according to the characteristic nodes of the DOM tree.
Through the steps, more automatic and intelligent text extraction and analysis can be realized, and excessive resource occupation and efficiency reduction caused by the fact that too many parameters need to be configured in each website are avoided.
Preferably, the mode for identifying the text field specifically includes:
and identifying the text field as a single field or multiple fields so as to perform automatic adaptation.
Through single multi-domain identification, in addition to text density identification, multi-element attribute density, probability density and the like can be identified, and other models in the prior art only use simple word number as density dimension and are invalid when the density of copyright information or related information is too high.
Further, the mode for identifying the text field is specifically:
performing mode training on a large number of webpage structures, and extracting a distribution model of texts on pages; wherein the distribution model adaptively learns and adds new features by input information;
analyzing and processing a DOM tree of the webpage, and carrying out block clustering on each node of the DOM tree to obtain a node clustering result;
and extracting necessary information from the node clustering result through the distribution model, and obtaining the mode of the text field through the necessary information.
Referring to fig. 2, a schematic structural diagram of a web page text extraction apparatus provided in embodiment 2 of the present invention includes:
the webpage source code acquiring module 101 is used for downloading a webpage and acquiring a webpage source code according to the webpage;
the visual tree generation module 102 is configured to create a DOM tree according to the web page source code, and generate a visual tree based on the DOM tree and the page style of the web page;
the text domain positioning module 103 is configured to generate a visual recognition model after rendering the visual tree by using a visual rendering technology, and position a text domain based on the visual recognition model;
and a text corpus acquiring module 104, configured to extract the feature text based on the text field, so as to acquire a text corpus of the webpage.
During specific implementation, a webpage source code is downloaded through the webpage source code obtaining module 101, a webpage source code is obtained according to the webpage, a DOM (document object model) is created through the visual tree generating module 102 according to the webpage source code, a visual tree is generated based on the DOM tree and the page style of the webpage, a visual identification model is generated after the visual tree is rendered through the text domain positioning module 103 by adopting a visual rendering technology, a text domain is positioned based on the visual identification model, and then the characteristic text is extracted based on the text domain through the text corpus obtaining module 104, so that the text corpus of the webpage is obtained, the defects of artificial rules and templates in the existing extraction technology can be effectively avoided, the webpage content can be effectively extracted, the compatibility is high, and impurities are completely removed.
In a preferred embodiment, the web page text extracting apparatus 100 further includes:
and the integration module is used for integrating and typesetting the text corpora of the webpage according to the actual visual effect.
In a preferred embodiment, the text corpus acquiring module includes:
an identification module for identifying a mode of a text field based on the located text field;
the characteristic node analysis module is used for analyzing the characteristic nodes of the DOM tree according to the mode of the text field;
and the characteristic text extraction module is used for extracting the characteristic text according to the characteristic nodes of the DOM tree.
In a preferred embodiment, the mode for identifying the text field specifically includes:
and identifying the text field as a single field or multiple fields so as to perform automatic adaptation.
In a preferred embodiment, the identification module comprises:
the distribution model extraction module is used for carrying out mode training on a large number of webpage structures and extracting a distribution model of texts on the pages; wherein the distribution model adaptively learns and adds new features by input information;
the clustering module is used for analyzing and processing the DOM tree of the webpage, and clustering each node of the DOM tree in a partitioning manner to obtain a node clustering result;
and the mode acquisition module is used for extracting necessary information from the node clustering result through the distribution model and acquiring the mode of the text field through the necessary information.
In summary, according to the method and the device for extracting the text of the webpage, disclosed by the invention, the webpage page is downloaded, the webpage source code is obtained according to the webpage page, the DOM tree is created according to the webpage source code, the visual tree is generated based on the DOM tree and the page style of the webpage page, the visual tree is rendered by adopting the visual rendering technology to generate the visual identification model, the text domain is positioned based on the visual identification model, and the characteristic text is extracted based on the text domain, so that the text corpus of the webpage page is obtained, the defects of artificial rules and templates in the existing extraction technology can be effectively avoided, the webpage content can be effectively extracted, the compatibility is high, and the impurity removal is complete.
While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention.

Claims (4)

1. A webpage text extraction method is characterized by comprising the following steps:
downloading a webpage, and acquiring a webpage source code according to the webpage;
creating a DOM tree according to the webpage source code, and generating a visual tree based on the DOM tree and the page style of the webpage;
rendering the visual tree by adopting a visual rendering technology to generate a visual recognition model, and positioning a text field based on the visual recognition model;
based on the located text field, identifying a pattern of the text field; the mode for identifying the text field specifically includes: identifying the text field as a single field or multiple fields so as to carry out automatic adaptation;
the mode for identifying the text field is specifically as follows:
performing mode training on a large number of webpage structures, and extracting a distribution model of texts on pages; wherein the distribution model adaptively learns and adds new features by input information;
analyzing and processing a DOM tree of the webpage, and carrying out block clustering on each node of the DOM tree to obtain a node clustering result;
extracting necessary information from the node clustering result through the distribution model, and obtaining the mode of the text field through the necessary information;
according to the mode of the text field, characteristic nodes of the DOM tree are analyzed out;
and extracting a characteristic text according to the characteristic nodes of the DOM tree.
2. The web page text extraction method according to claim 1, further comprising the steps of:
and integrating and typesetting the text corpora of the webpage according to the actual visual effect.
3. A web page text extraction apparatus, comprising:
the webpage source code acquisition module is used for downloading a webpage and acquiring a webpage source code according to the webpage;
the visual tree generation module is used for creating a DOM tree according to the webpage source code and generating a visual tree based on the DOM tree and the page style of the webpage;
the text domain positioning module is used for generating a visual recognition model after rendering the visual tree by adopting a visual rendering technology and positioning a text domain based on the visual recognition model;
the text corpus acquisition module is used for extracting a characteristic text based on the text field so as to acquire a text corpus of the webpage; the text corpus acquiring module comprises:
an identification module for identifying a mode of a text field based on the located text field; the mode for identifying the text field specifically includes: identifying the text field as a single field or multiple fields so as to carry out automatic adaptation;
the identification module comprises:
the distribution model extraction module is used for carrying out mode training on a large number of webpage structures and extracting a distribution model of texts on the pages; wherein the distribution model adaptively learns and adds new features by input information;
the clustering module is used for analyzing and processing the DOM tree of the webpage, and clustering each node of the DOM tree in a partitioning manner to obtain a node clustering result;
the mode acquisition module is used for extracting necessary information from the node clustering result through the distribution model and acquiring the mode of the text field through the necessary information;
the characteristic node analysis module is used for analyzing the characteristic nodes of the DOM tree according to the mode of the text field;
and the characteristic text extraction module is used for extracting the characteristic text according to the characteristic nodes of the DOM tree.
4. The web page text extraction apparatus according to claim 3, further comprising:
and the integration module is used for integrating and typesetting the text corpora of the webpage according to the actual visual effect.
CN201710581136.XA 2017-07-17 2017-07-17 Webpage text extraction method and device Expired - Fee Related CN107436931B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710581136.XA CN107436931B (en) 2017-07-17 2017-07-17 Webpage text extraction method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710581136.XA CN107436931B (en) 2017-07-17 2017-07-17 Webpage text extraction method and device

Publications (2)

Publication Number Publication Date
CN107436931A CN107436931A (en) 2017-12-05
CN107436931B true CN107436931B (en) 2020-12-22

Family

ID=60460257

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710581136.XA Expired - Fee Related CN107436931B (en) 2017-07-17 2017-07-17 Webpage text extraction method and device

Country Status (1)

Country Link
CN (1) CN107436931B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108255975B (en) * 2017-12-27 2021-05-07 东软集团股份有限公司 Template construction method, page content capture method and device, medium and equipment
CN111104636B (en) * 2019-12-30 2023-03-24 上海海事大学 Webpage shipping date data extraction method based on multi-view learning
CN111241446B (en) * 2020-01-13 2023-10-31 杭州安恒信息技术股份有限公司 Method, device, equipment and medium for extracting text content of web page
CN112667940B (en) * 2020-10-15 2022-02-18 广东电子工业研究院有限公司 Webpage text extraction method based on deep learning

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101290624A (en) * 2008-06-11 2008-10-22 华东师范大学 News web page metadata automatic extraction method
CN102298638A (en) * 2011-08-31 2011-12-28 北京中搜网络技术股份有限公司 Method and system for extracting news webpage contents by clustering webpage labels
CN102663023A (en) * 2012-03-22 2012-09-12 浙江盘石信息技术有限公司 Implementation method for extracting web content

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101290624A (en) * 2008-06-11 2008-10-22 华东师范大学 News web page metadata automatic extraction method
CN102298638A (en) * 2011-08-31 2011-12-28 北京中搜网络技术股份有限公司 Method and system for extracting news webpage contents by clustering webpage labels
CN102663023A (en) * 2012-03-22 2012-09-12 浙江盘石信息技术有限公司 Implementation method for extracting web content

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
异构就业数据集成服务的设计与实现;张昕;《中国优秀硕士学位论文全文数据库 信息科技辑》;20150815(第8期);第32-45页 *

Also Published As

Publication number Publication date
CN107436931A (en) 2017-12-05

Similar Documents

Publication Publication Date Title
CN106650943B (en) Auxiliary writing method and device based on artificial intelligence
CN107436931B (en) Webpage text extraction method and device
CN102253937B (en) Method and related device for acquiring information of interest in webpages
CN105843965B (en) A kind of Deep Web Crawler form filling method and apparatus based on URL subject classification
CN111581376B (en) Automatic knowledge graph construction system and method
CN103853834B (en) Text structure analysis-based Web document abstract generation method
CN107943911A (en) Data pick-up method, apparatus, computer equipment and readable storage medium storing program for executing
CN109033282B (en) Webpage text extraction method and device based on extraction template
CN104199871B (en) A kind of high speed examination question introduction method for wisdom teaching
CN106446072B (en) The treating method and apparatus of web page contents
CN104408078A (en) Construction method for key word-based Chinese-English bilingual parallel corpora
CN103077164A (en) Text analysis method and text analyzer
CN111783394A (en) Training method of event extraction model, event extraction method, system and equipment
RU2666277C1 (en) Text segmentation
CN111143571B (en) Entity labeling model training method, entity labeling method and device
CN108287911A (en) A kind of Relation extraction method based on about fasciculation remote supervisory
CN111737623A (en) Webpage information extraction method and related equipment
CN103530429A (en) Webpage content extracting method
CN114372153A (en) Structured legal document warehousing method and system based on knowledge graph
CN111143531A (en) Question-answer pair construction method, system, device and computer readable storage medium
CN106372053B (en) Syntactic analysis method and device
CN104750484B (en) A kind of code abstraction generating method based on maximum entropy model
CN114970502B (en) Text error correction method applied to digital government
CN107451215B (en) Feature text extraction method and device
CN105808561A (en) Method and device for extracting abstract from webpage

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20201125

Address after: Room 5303, No. 1023, Gaopu Road, Tianhe Software Park, Tianhe District, Guangzhou City, Guangdong Province

Applicant after: Yunrun Da Data Service Co.,Ltd.

Address before: 510000 Yuexiu District, Guangzhou Province, north of the text of the text of the North Road, No. 68, the east wing of the text of the building on the ground floor, No. six, No. 602, No.

Applicant before: GUANGZHOU TEDAO INFORMATION TECHNOLOGY Co.,Ltd.

GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20201222