CN107436931B

CN107436931B - Webpage text extraction method and device

Info

Publication number: CN107436931B
Application number: CN201710581136.XA
Authority: CN
Inventors: 晋彤
Original assignee: Yunrun Da Data Service Co ltd
Current assignee: Yunrun Da Data Service Co ltd
Priority date: 2017-07-17
Filing date: 2017-07-17
Publication date: 2020-12-22
Anticipated expiration: 2037-07-17
Also published as: CN107436931A

Abstract

The invention discloses a webpage content extraction method and a webpage content extraction device, wherein a webpage page is downloaded, a webpage source code is obtained according to the webpage page, a DOM tree is created according to the webpage source code, a visual tree is generated based on the DOM tree and the page style of the webpage page, a visual rendering technology is adopted to render the visual tree and then generate a visual identification model, a text field is positioned based on the visual identification model, and a characteristic text is extracted based on the text field, so that the content corpus of the webpage page is obtained, the defects of manual rules and templates in the existing extraction technology can be effectively avoided, the webpage content can be effectively extracted, the compatibility is high, and the impurities are completely removed.

Description

Webpage text extraction method and device

Technical Field

The invention relates to the field of computers, in particular to a webpage text extraction method and device.

Background

In the news (or information) search field, news text extraction is an essential link, and the quality of the text extraction determines the quality and the user experience of news search. At present, news text extraction methods are various in format, and extraction is mainly performed in a template (or wrapper) based mode. Extracting based on a template mode: firstly, defining a template, and then writing a program to analyze and execute the template to obtain data. According to the template generation mode, the method can be divided into the following steps: manual template extraction and automatic template extraction. And (5) extracting the manual template. And manually writing a template aiming at the extracted target site, wherein the template can be in a regular matching mode or a simple character string matching first-order matching mode. Automatic template extraction utilizes a machine learning algorithm to acquire a part of webpage data from a target website for learning training, and then a program extracts the data by utilizing the template. The disadvantage of manually writing templates is that huge manpower resources are consumed to write templates, and the cost for maintaining templates is very high with the change of target websites. Whether the template is generated manually or automatically, the assumption is that the data of the website is generated through the template, the basic problems of some large websites are not great, namely different entrances may have different templates, but for a plurality of small and medium websites, the template is not good, only most information can be extracted by using the template extraction, and more chances are provided for containing junk information.

Disclosure of Invention

The embodiment of the invention aims to provide a webpage text extraction method and device, which can effectively avoid the defects of manual rules and templates in the prior extraction technology, have high compatibility, can effectively extract webpage contents, and have high compatibility and complete impurity removal.

In order to achieve the above object, an embodiment of the present invention provides a method for extracting a web page text, including the steps of:

downloading a webpage, and acquiring a webpage source code according to the webpage;

creating a DOM tree according to the webpage source code, and generating a visual tree based on the DOM tree and the page style of the webpage;

rendering the visual tree by adopting a visual rendering technology to generate a visual recognition model, and positioning a text field based on the visual recognition model;

and extracting a characteristic text based on the text field so as to obtain a text corpus of the webpage.

Compared with the prior art, the webpage text extraction method disclosed by the invention has the advantages that the webpage is downloaded, the webpage source code is obtained according to the webpage, the DOM tree is created according to the webpage source code, the visual tree is generated based on the DOM tree and the page style of the webpage, the visual identification model is generated after the visual tree is rendered by adopting the visual rendering technology, the text domain is positioned based on the visual identification model, and the characteristic text is extracted based on the text domain, so that the text corpus of the webpage is obtained, the defects of manual rules and templates in the prior extraction technology can be effectively avoided, the webpage content can be effectively extracted, the compatibility is high, and the impurities are completely removed.

As an improvement of the scheme, the method further comprises the following steps:

and integrating and typesetting the text corpora of the webpage according to the actual visual effect.

As an improvement of the above scheme, the extracting the feature text based on the text field to obtain the text corpus of the web page specifically includes:

based on the located text field, identifying a pattern of the text field;

according to the mode of the text field, characteristic nodes of the DOM tree are analyzed out;

and extracting a characteristic text according to the characteristic nodes of the DOM tree.

Based on the located text field, identifying a pattern of the text field;

As an improvement of the above scheme, the mode for identifying the text field specifically includes:

and identifying the text field as a single field or multiple fields so as to perform automatic adaptation.

performing mode training on a large number of webpage structures, and extracting a distribution model of texts on pages; wherein the distribution model adaptively learns and adds new features by input information;

analyzing and processing a DOM tree of the webpage, and carrying out block clustering on each node of the DOM tree to obtain a node clustering result;

and extracting necessary information from the node clustering result through the distribution model, and obtaining the mode of the text field through the necessary information.

The embodiment of the invention also provides a webpage text extraction device, which comprises:

the webpage source code acquisition module is used for downloading a webpage and acquiring a webpage source code according to the webpage;

the visual tree generation module is used for creating a DOM tree according to the webpage source code and generating a visual tree based on the DOM tree and the page style of the webpage;

the text domain positioning module is used for generating a visual recognition model after rendering the visual tree by adopting a visual rendering technology and positioning a text domain based on the visual recognition model;

and the text corpus acquisition module is used for extracting the characteristic text based on the text field so as to acquire the text corpus of the webpage.

Compared with the prior art, the webpage text extracting device disclosed by the invention has the advantages that the webpage is downloaded through the webpage source code acquiring module, the webpage source code is acquired according to the webpage, the DOM tree is created according to the webpage source code through the visual tree generating module, the visual tree is generated based on the DOM tree and the page style of the webpage, the visual tree is rendered through the visual rendering technology through the text domain positioning module to generate the visual identification model, the text domain is positioned based on the visual identification model, and then the characteristic text is extracted based on the text domain through the text corpus acquiring module, so that the corpus of the webpage is obtained, the defects of manual rules and templates in the prior art can be effectively avoided, the webpage content can be effectively extracted, the compatibility is high, and the impurity removal is complete.

As an improvement of the above scheme, the method further comprises the following steps:

and the integration module is used for integrating and typesetting the text corpora of the webpage according to the actual visual effect.

As an improvement of the above solution, the text corpus acquiring module includes:

an identification module for identifying a mode of a text field based on the located text field;

the characteristic node analysis module is used for analyzing the characteristic nodes of the DOM tree according to the mode of the text field;

and the characteristic text extraction module is used for extracting the characteristic text according to the characteristic nodes of the DOM tree.

As an improvement of the above solution, the identification module includes:

the distribution model extraction module is used for carrying out mode training on a large number of webpage structures and extracting a distribution model of texts on the pages; wherein the distribution model adaptively learns and adds new features by input information;

the clustering module is used for analyzing and processing the DOM tree of the webpage, and clustering each node of the DOM tree in a partitioning manner to obtain a node clustering result;

and the mode acquisition module is used for extracting necessary information from the node clustering result through the distribution model and acquiring the mode of the text field through the necessary information.

Drawings

Fig. 1 is a schematic flowchart of a method for extracting a web page text in embodiment 1 of the present invention.

Fig. 2 is a schematic structural diagram of a web page text extraction apparatus in embodiment 2 of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, a schematic flow chart of a method for extracting a web page text provided in embodiment 1 of the present invention includes the steps of:

s1, downloading a webpage, and acquiring a webpage source code according to the webpage;

s2, creating a DOM tree according to the webpage source code, and generating a visual tree based on the DOM tree and the page style of the webpage;

s3, generating a visual recognition model after rendering the visual tree by adopting a visual rendering technology, and positioning a text field based on the visual recognition model;

and S4, extracting the characteristic text based on the text field, thereby obtaining the text corpus of the webpage.

During specific implementation, a webpage page is downloaded, a webpage source code is obtained according to the webpage page, a DOM tree is created according to the webpage source code, a visual tree is generated based on the DOM tree and the page style of the webpage page, a visual rendering technology is adopted to render the visual tree and then generate a visual identification model, a text domain is positioned based on the visual identification model, and a characteristic text is extracted based on the text domain, so that the text corpus of the webpage page is obtained, the defects of artificial rules and templates in the existing extraction technology can be effectively avoided, the webpage content can be effectively extracted, the compatibility is high, a mass collection engine of the internet can provide more automatic and intelligent text extraction and analysis, each website is prevented from being configured with a large number of parameters, and even self-learning and time spending of template analogy are avoided.

In a preferred embodiment, on the basis of embodiment 1, the method further comprises the following steps:

The extracted corpus materials can be completely combined and typeset according to the actual visual effect, and the readability can be increased.

In a preferred embodiment, based on embodiment 1, step S4 specifically includes:

based on the located text field, identifying a pattern of the text field;

Through the steps, more automatic and intelligent text extraction and analysis can be realized, and excessive resource occupation and efficiency reduction caused by the fact that too many parameters need to be configured in each website are avoided.

Preferably, the mode for identifying the text field specifically includes:

Through single multi-domain identification, in addition to text density identification, multi-element attribute density, probability density and the like can be identified, and other models in the prior art only use simple word number as density dimension and are invalid when the density of copyright information or related information is too high.

Further, the mode for identifying the text field is specifically:

Referring to fig. 2, a schematic structural diagram of a web page text extraction apparatus provided in embodiment 2 of the present invention includes:

the webpage source code acquiring module 101 is used for downloading a webpage and acquiring a webpage source code according to the webpage;

the visual tree generation module 102 is configured to create a DOM tree according to the web page source code, and generate a visual tree based on the DOM tree and the page style of the web page;

the text domain positioning module 103 is configured to generate a visual recognition model after rendering the visual tree by using a visual rendering technology, and position a text domain based on the visual recognition model;

and a text corpus acquiring module 104, configured to extract the feature text based on the text field, so as to acquire a text corpus of the webpage.

During specific implementation, a webpage source code is downloaded through the webpage source code obtaining module 101, a webpage source code is obtained according to the webpage, a DOM (document object model) is created through the visual tree generating module 102 according to the webpage source code, a visual tree is generated based on the DOM tree and the page style of the webpage, a visual identification model is generated after the visual tree is rendered through the text domain positioning module 103 by adopting a visual rendering technology, a text domain is positioned based on the visual identification model, and then the characteristic text is extracted based on the text domain through the text corpus obtaining module 104, so that the text corpus of the webpage is obtained, the defects of artificial rules and templates in the existing extraction technology can be effectively avoided, the webpage content can be effectively extracted, the compatibility is high, and impurities are completely removed.

In a preferred embodiment, the web page text extracting apparatus 100 further includes:

In a preferred embodiment, the text corpus acquiring module includes:

In a preferred embodiment, the mode for identifying the text field specifically includes:

In a preferred embodiment, the identification module comprises:

In summary, according to the method and the device for extracting the text of the webpage, disclosed by the invention, the webpage page is downloaded, the webpage source code is obtained according to the webpage page, the DOM tree is created according to the webpage source code, the visual tree is generated based on the DOM tree and the page style of the webpage page, the visual tree is rendered by adopting the visual rendering technology to generate the visual identification model, the text domain is positioned based on the visual identification model, and the characteristic text is extracted based on the text domain, so that the text corpus of the webpage page is obtained, the defects of artificial rules and templates in the existing extraction technology can be effectively avoided, the webpage content can be effectively extracted, the compatibility is high, and the impurity removal is complete.

While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention.

Claims

1. A webpage text extraction method is characterized by comprising the following steps:

based on the located text field, identifying a pattern of the text field; the mode for identifying the text field specifically includes: identifying the text field as a single field or multiple fields so as to carry out automatic adaptation;

the mode for identifying the text field is specifically as follows:

extracting necessary information from the node clustering result through the distribution model, and obtaining the mode of the text field through the necessary information;

2. The web page text extraction method according to claim 1, further comprising the steps of:

3. A web page text extraction apparatus, comprising:

the text corpus acquisition module is used for extracting a characteristic text based on the text field so as to acquire a text corpus of the webpage; the text corpus acquiring module comprises:

an identification module for identifying a mode of a text field based on the located text field; the mode for identifying the text field specifically includes: identifying the text field as a single field or multiple fields so as to carry out automatic adaptation;

the identification module comprises:

the mode acquisition module is used for extracting necessary information from the node clustering result through the distribution model and acquiring the mode of the text field through the necessary information;

4. The web page text extraction apparatus according to claim 3, further comprising: