CN107153716B - Webpage content extraction method and device - Google Patents

Webpage content extraction method and device Download PDF

Info

Publication number
CN107153716B
CN107153716B CN201710418653.5A CN201710418653A CN107153716B CN 107153716 B CN107153716 B CN 107153716B CN 201710418653 A CN201710418653 A CN 201710418653A CN 107153716 B CN107153716 B CN 107153716B
Authority
CN
China
Prior art keywords
webpage
extracted
html
label
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710418653.5A
Other languages
Chinese (zh)
Other versions
CN107153716A (en
Inventor
余婷婷
胡飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201710418653.5A priority Critical patent/CN107153716B/en
Publication of CN107153716A publication Critical patent/CN107153716A/en
Application granted granted Critical
Publication of CN107153716B publication Critical patent/CN107153716B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/14Tree-structured documents

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The application discloses a webpage content extraction method and device. The webpage content extraction method comprises the following steps: analyzing the webpage to be extracted to determine a hypertext markup language html tag contained in the webpage to be extracted; extracting html features of a webpage to be extracted from the html tag; importing the extracted html features into a pre-trained picture webpage recognition model; and in response to the fact that the webpage to be extracted is determined to be the picture webpage, extracting the picture in the webpage to be extracted and the html tag corresponding to the picture. According to the method and the device, different strategies can be adopted to extract the webpage content based on the type (such as the picture type and the non-picture type) of the webpage to be extracted, so that the accuracy and comprehensiveness of webpage content extraction are improved.

Description

Webpage content extraction method and device
Technical Field
The application relates to the technical field of computers, in particular to the technical field of internet, and particularly relates to a method and a device for extracting webpage content.
Background
For Web data mining, the extraction of the text content of a Web page is usually used as a basic step in the early stage of data mining. The method can not efficiently and accurately extract the text content of the webpage, is easy to popularize to each website, and determines the effect of subsequent data mining.
In the prior art, only a single extraction algorithm is usually adopted to extract the text content of the webpage. Because the website has a plurality of sub-pages and various forms, the website main body can be characters, pictures and even pictures and texts which are mixed, and the website labels in the website are also various; in addition, there are a large number of parts such as home page navigation pages, list pages, etc. that do not require extraction of content, and content pages that require extraction of material. If a single algorithm is adopted for extraction without discrimination, excessive noise is easy to extract, and the requirements of both accuracy and comprehensiveness of webpage text content extraction cannot be met.
Disclosure of Invention
The present application aims to provide an improved method and apparatus for extracting web content, so as to solve the technical problems mentioned in the above background section.
In a first aspect, the present application provides a method for extracting web page content, including: analyzing the webpage to be extracted to determine a hypertext markup language html tag contained in the webpage to be extracted; extracting html features of a webpage to be extracted from the html tag; importing the extracted html features into a pre-trained picture webpage recognition model; and in response to the fact that the webpage to be extracted is determined to be the picture webpage, extracting the picture in the webpage to be extracted and the html tag corresponding to the picture.
In some embodiments, the method further comprises: in response to the fact that the webpage to be extracted is determined to be a non-picture webpage, the extracted html features are led into a pre-trained material webpage identification model; and in response to the fact that the webpage to be extracted is determined to be a material webpage, extracting pictures and texts in the webpage to be extracted.
In some embodiments, extracting html features of a webpage to be extracted from the html tag includes: screening out html text tags corresponding to texts of the webpage to be extracted from the html tags; and traversing each html text label of the webpage to be extracted to determine the html characteristics of the webpage to be extracted.
In some embodiments, the html features include at least one of: the html text label with the category of the picture label accounts for the proportion of the html text label of the webpage to be extracted; the html text label with the category of the hyperlink label accounts for the proportion of the html text label of the webpage to be extracted; the html text label with the category of the form label accounts for the proportion of the html text label of the webpage to be extracted; the text density of each html body label of the webpage to be extracted is the ratio of the text length contained in the html body label to the sum of the text lengths contained in the html body labels of the webpage to be extracted; and the statistical characteristics of the text density of each html text label of the webpage to be extracted.
In some embodiments, before parsing the web page to be extracted to determine the first html tag included in the web page to be extracted, the method further includes: and in response to receiving the uniform resource locator URL of the webpage, analyzing each webpage belonging to the same website as the webpage to be extracted.
In a second aspect, the present application provides a web content extracting apparatus, including: the analysis module is configured for analyzing the webpage to be extracted to determine the html tag of the hypertext markup language contained in the webpage to be extracted; the feature extraction module is configured to extract html features of the webpage to be extracted from the html tag; the picture webpage recognition module is configured for importing the extracted html features into a pre-trained picture webpage recognition model; and the picture content characteristic extraction module is configured to respond to the fact that the webpage to be extracted is the picture webpage, and extract the picture in the webpage to be extracted and the html tag corresponding to the picture.
In some embodiments, the apparatus further comprises: the material webpage identification module is configured for responding to the fact that the webpage to be extracted is determined to be a non-picture webpage, and the extracted html features are led into a pre-trained material webpage identification model; and the material content characteristic extraction module is configured for responding to the condition that the webpage to be extracted is the material webpage and extracting the pictures and the texts in the webpage to be extracted.
In some embodiments, the feature extraction module is further configured to: screening out html text tags corresponding to texts of the webpage to be extracted from the html tags; and traversing each html text label of the webpage to be extracted to determine the html characteristics of the webpage to be extracted.
In some embodiments, the html features include at least one of: the html text label with the category of the picture label accounts for the proportion of the html text label of the webpage to be extracted; the html text label with the category of the hyperlink label accounts for the proportion of the html text label of the webpage to be extracted; the html text label with the category of the form label accounts for the proportion of the html text label of the webpage to be extracted; the text density of each html body label of the webpage to be extracted is the ratio of the text length contained in the html body label to the sum of the text lengths contained in the html body labels of the webpage to be extracted; and the statistical characteristics of the text density of each html text label of the webpage to be extracted.
In some embodiments, the parsing module, before parsing the web page to be extracted to determine the first html tag included in the web page to be extracted, is further configured to: and in response to receiving the uniform resource locator URL of the webpage, analyzing each webpage belonging to the same website as the webpage to be extracted.
In a third aspect, the present application provides an electronic device comprising one or more processors; a storage device for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement the web content extraction method as above.
In a fourth aspect, the present application provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the web content extraction method as above.
According to the technical scheme, the webpage to be extracted is analyzed, the html tag contained in the webpage to be extracted is determined, the html feature is extracted from the html tag, whether the webpage to be extracted is a picture webpage or not is determined based on the html feature, and if the webpage to be extracted is the picture webpage, the picture in the webpage to be extracted is extracted. Therefore, different strategies can be adopted to extract the webpage content based on the type (such as picture type and non-picture type) of the webpage to be extracted, and the accuracy and comprehensiveness of webpage content extraction are improved.
Drawings
Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:
FIG. 1 is an exemplary system architecture diagram in which the present application may be applied;
FIG. 2 is a schematic flow chart diagram illustrating one embodiment of a web page content extraction method according to the present application;
FIG. 3 is a schematic flow chart diagram illustrating another embodiment of a web content extraction method according to the present application;
fig. 4 is a decomposition flow chart of extracting html features of an extracted web page from an html tag in a web page content extraction method according to embodiments of the present application;
FIG. 5 is a diagram illustrating an application scenario of a web content extraction method according to the present application;
FIG. 6 is a schematic block diagram of an embodiment of a web content extraction apparatus according to the present application;
fig. 7 is a schematic structural diagram of a computer system suitable for implementing the terminal device or the server according to the embodiment of the present application.
Detailed Description
The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.
Fig. 1 illustrates an exemplary system architecture 100 to which embodiments of the web content extraction method or web content extraction apparatus of the present application may be applied.
As shown in fig. 1, the system architecture 100 may include a first server 101, a plurality of web servers 102, and a network 103. The network 103 is used to provide a medium for communication links between the first server 101 and the respective web servers 102. Network 103 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.
A user may use the first server 101 to interact with various web servers 102 over the network 103 to receive or send messages or the like. The first server 101 may have various communication applications installed thereon, such as a web browser application, a web crawling application, and the like.
The first server 101 may be a server providing various services, such as a data mining server that performs web content extraction on web pages provided by the web server 102. The data mining server can analyze and process the obtained webpage data, so as to extract the text content of the webpage.
It should be noted that, the web content extracting method provided in the embodiment of the present application is generally executed by the first server 101, and accordingly, the web content extracting apparatus is generally disposed in the first server 101.
It should be understood that the number of first servers 101, networks 103, and web servers 102 in fig. 1 is merely illustrative. There may be any number of first servers, network and web servers, as desired for the implementation.
With continued reference to FIG. 2, a flow 200 of one embodiment of a web page content extraction method according to the present application is shown. The webpage content extraction method comprises the following steps:
step 210, analyzing the webpage to be extracted to determine the html tag of the hypertext markup language contained in the webpage to be extracted.
In this embodiment, the electronic device (for example, the server shown in fig. 1) on which the web content extraction method operates may obtain the web data from one or more web servers through a wired connection manner or a wireless connection manner. For example, the electronic device may receive an address, i.e., a web address, of a web page desired to be subjected to web page content extraction, which is input by a user. In practice, the web address is generally represented by a Uniform Resource Locator (URL). It should be noted that the wireless connection means may include, but is not limited to, a 3G/4G connection, a WiFi connection, a bluetooth connection, a WiMAX connection, a Zigbee connection, a uwb (ultra wideband) connection, and other wireless connection means now known or developed in the future.
In some application scenarios, after receiving a web address input by a user, the electronic device may send a web content acquisition request to a corresponding web server based on the web address to request to acquire data of a web page (i.e., a web page to be extracted) corresponding to the web address.
Generally, the data of the web page may include a plurality of HyperText Markup Language (html) files, and the html files may indicate the type of the portion of the web page data, such as text, graphics, animation, sound, table, link, and the like. Each html file may represent a portion of the content of a web page. The html file can include a plurality of html tags, and all html tags belonging to the webpage can be obtained by analyzing the webpage to be extracted.
And step 220, extracting html features of the webpage to be extracted from the html tag.
Html features are understood here to mean any quantitative and/or qualitative description that can represent the characteristics of the web page to be extracted.
And step 230, importing the extracted html features into a pre-trained picture webpage recognition model.
In this step, by importing the html features of the web page to be recognized into the pre-trained picture web page recognition model, it can be determined whether the web page to be recognized is a picture web page. Here, the picture web page may be understood as a web page in which the proportion of the picture content in the web page exceeds a preset proportion threshold, for example.
Here, the picture web page recognition model may be, for example, a support vector machine learning classification model. During training, for example, html file data of a plurality of web pages may be collected first, whether the web pages are picture web pages is labeled, and html features of the web pages are extracted. And inputting the labeling information of the webpages and the html characteristics of the webpages as labeling data into a learning classification model of a support vector machine and performing cyclic training on the model, wherein when the recall ratio of the model reaches a preset recall ratio threshold value, the model can be considered to be trained completely.
And 240, in response to the fact that the webpage to be extracted is determined to be the picture webpage, extracting the picture in the webpage to be extracted and the html tag corresponding to the picture.
In step 230, the web page to be extracted is imported into a pre-trained picture web page recognition model for judgment. If the web page to be identified is a picture web page, in this step 240, all the contents in the picture format (including but not limited to the jpg format,. bmp format,. png format,. raw, etc.) in the web page to be identified and the tags corresponding to the pictures may be extracted as the finally extracted web page contents.
Before extracting the web page content, the method for extracting the web page content of the embodiment first judges whether the web page to be extracted is a picture web page based on html features of the web page to be extracted, and extracts a picture and a tag corresponding to the picture as a content extraction result of the page when the web page to be extracted is the picture page. Therefore, the problem that when the page to be extracted is a non-picture page (for example, the text content comprises not only pictures but also a picture and text mixed page with a certain proportion of characters) is avoided, the extraction of the webpage content is incomplete due to the fact that only the pictures in the page are extracted.
Referring to FIG. 3, a schematic flow chart 300 of another embodiment of a web content extraction method of the present application is shown.
The webpage content extracting method of the embodiment comprises the following steps:
step 310, analyzing the webpage to be extracted to determine the html tag of the hypertext markup language contained in the webpage to be extracted.
And 320, extracting html features of the webpage to be extracted from the html tag.
And step 330, importing the extracted html features into a pre-trained picture webpage recognition model.
Step 340, in response to determining that the webpage to be extracted is the picture webpage, extracting the picture in the webpage to be extracted and the tag corresponding to the picture.
Steps 310 to 340 of this embodiment are similar to steps 210 to 240 of the embodiment shown in fig. 2, and are not described again here.
Unlike the embodiment shown in fig. 2, this embodiment further includes:
and 350, in response to the fact that the webpage to be extracted is determined to be a non-picture webpage, importing the extracted html features into a pre-trained material webpage identification model.
Here, the "material web page" can be understood as a non-picture type web page having a web page content extraction value.
For example, in some application scenarios, the web page to be extracted is a home page of a certain company, and since the web page includes descriptions of the business scope of the company and provides photos of products of the certain company, it can be considered that the content of the web page to be extracted has a certain extraction value and belongs to the category of the material web page.
In other application scenarios, the web page to be extracted is a navigation page, and the user needs to click on a hyperlink in the navigation page for a certain category to obtain specific description information for the category. Because the navigation page does not contain the content with practical value, the webpage content contained in the webpage to be extracted does not have extraction value and does not belong to the category of material webpages. For example, the web page to be extracted is a homepage of an weather forecast website. On the homepage, only hyperlinks corresponding to a plurality of cities are provided, and the user can obtain weather information corresponding to a city desired to be weather-queried only by clicking the hyperlink corresponding to the city into a sub-page of the weather forecast website. Since the main page of the weather forecast website does not contain any weather information, the main page is considered to contain no content with practical value, and the web page content contained in the main page has no extraction value and does not belong to the category of the material web page.
In addition, in this step, similar to the pre-trained picture web page recognition model, the pre-trained material web page recognition model may also be a support vector machine learning classification model. During training, for example, html file data of a plurality of web pages can be collected first, whether the web pages are material web pages is labeled, and html features of the web pages are extracted. And inputting the labeling information of the webpages and the html characteristics of the webpages as labeling data into a learning classification model of a support vector machine and performing cyclic training on the model, wherein when the recall ratio of the model reaches a preset recall ratio threshold value, the model can be considered to be trained completely.
As the material web page identification model is trained, the web pages can be labeled manually, namely, yes or no material web pages are labeled. Therefore, in the training process of the material webpage identification model, the html features of the material webpage can be continuously learned, so that the model output is continuously adjusted, and the preset calling rate is finally achieved.
And step 360, in response to the fact that the webpage to be extracted is determined to be a material webpage, extracting pictures and texts in the webpage to be extracted.
If the web page to be extracted is judged to be the material web page through the pre-trained material web page recognition model in step 350, the picture and the text therein can be extracted as the web page content of the web page.
On the other hand, if the pre-trained material web page recognition model in step 350 determines that the web page to be extracted is not a material web page, it may indicate that the web page to be extracted does not include valuable content (for example, the web page to be extracted is a navigation page as described above), and at this time, the web page to be extracted may not be subjected to any content extraction operation any more to avoid waste of computing resources and/or network resources.
In this step, if the web page to be extracted is a material web page, the picture and text in the web page may be extracted based on an existing web page content analysis algorithm (e.g., a reproducibility algorithm).
Before extracting the web page content, the web page content extracting method of this embodiment first judges whether the web page to be extracted is a picture web page based on html features of the web page to be extracted, and extracts a picture therein as a content extraction result of the web page when the web page to be extracted is the picture web page. Therefore, the problem that when the page to be extracted is a non-picture page (for example, the text content comprises not only pictures but also a picture and text mixed page with a certain proportion of characters) is avoided, the extraction of the webpage content is incomplete due to the fact that only the pictures in the page are extracted.
In addition, the webpage content extraction method of the embodiment introduces the non-picture webpage into the pre-trained material webpage identification model, judges whether the webpage to be extracted is a material webpage, and extracts pictures and characters in the material webpage when the webpage is to be extracted, so that the comprehensiveness of webpage content extraction is ensured. On the other hand, if the web page to be extracted is not a material web page, at this time, any content extraction operation may not be performed on the web page to be extracted any more so as to avoid waste of computing resources and/or network resources.
In some optional implementations of the web page content extraction methods according to the above two embodiments of the present application, the extracting html features of the web page to be extracted from the html tag in steps 220 and 320 may be implemented by a decomposition flow 400 shown in fig. 4.
Specifically, in step 410, html text tags corresponding to the text of the web page to be extracted are screened out from the html tags.
In some application scenarios, for example, an html tag between the tag < body > and the tag </body > may be taken as an html body tag corresponding to the body of the web page to be extracted.
In step 420, each html text tag of the web page to be extracted is traversed to determine html features of the web page to be extracted.
In these alternative implementations, the html features may include, for example, at least one of: the html text label with the category of the picture label accounts for the proportion of the html text label of the webpage to be extracted; the html text label with the category of the hyperlink label accounts for the proportion of the html text label of the webpage to be extracted; the html text label with the category of the form label accounts for the proportion of the html text label of the webpage to be extracted; the text density of each html body label of the webpage to be extracted is the ratio of the text length contained in the html body label to the sum of the text lengths contained in the html body labels of the webpage to be extracted; and the statistical characteristics of the text density of each html text label of the webpage to be extracted. Here, the statistical features of the text density of each html body tag of the web page to be extracted may include, but are not limited to, the statistical features of the mean, variance, and the like of the text density of each html body tag.
In some alternative implementations, although a plurality of html features of the web page to be extracted are extracted in steps 220 and 320. However, when the corresponding recognition model (for example, a pre-trained picture web page recognition model or a pre-trained material web page recognition model) is introduced to determine the web page type, each recognition model may only select a part of html features as a basis for recognizing the web page type.
For example, in some application scenarios, the pre-trained picture webpage recognition model may perform recognition based on html features, such as the proportion of html text tags with picture tags in html text tags of a webpage to be extracted, the proportion of html text tags with hyperlink tags in html text tags of the webpage to be extracted, the proportion of html text tags with form tags in html text tags of the webpage to be extracted, and the text density of each html text tag of the webpage to be extracted. The pre-trained material webpage identification model can adopt html characteristics, namely the proportion of html text labels with the category of hyperlink labels to html text labels of the webpage to be extracted, the proportion of html text labels with the category of form labels to html text labels of the webpage to be extracted, the text density of each html text label of the webpage to be extracted and the statistical characteristics of the text density of each html text label of the webpage to be extracted, as the basis for identification.
In addition, in some optional implementation manners of the web content extraction methods according to the above two embodiments of the present application, before parsing the web page to be extracted in step 210 and step 310 to determine the html tag included in the web page to be extracted, the web content extraction method according to the present application may further include:
and in response to receiving the uniform resource locator URL of the webpage, analyzing each webpage belonging to the same website as the webpage to be extracted.
Therefore, the user only needs to input the URL of one webpage belonging to a certain website, the electronic equipment can analyze the webpages of the website and take the webpages as the webpages to be extracted, and the webpage content extraction operation is respectively executed on the webpages to be extracted. Therefore, the extraction efficiency of webpage content extraction is improved.
Referring to fig. 5, a schematic diagram of an application scenario of the web content extraction method of the present application is shown.
In step 510, the user inputs a home page address of a website desired to perform content extraction.
In step 520, the sub-pages of each level associated with the web page are parsed. For example, a background crawler may be utilized to crawl out sub-pages at various levels of the website.
In step 530, html features of the web pages belonging to the website are respectively parsed.
In step 540, the html features of each web page are input into a pre-trained picture web page recognition model for recognition.
In step 550, if the sub-page a of the website belongs to the picture webpage, the picture in the sub-page a and the tag corresponding to each picture are extracted.
In step 560, if the sub-page B of the website does not belong to the picture web page, the sub-page is further input into a pre-trained material web page recognition model for recognition.
In step 570, the sub-page B belongs to the material web page, and then the redability algorithm is performed on the sub-page B to extract the pictures and the text in the sub-page B.
As can be seen from the above description of steps 510 to 570, the web content extraction method of the present application can extract web content for different types of web pages by using different extraction strategies, thereby improving accuracy and comprehensiveness of web content extraction.
With further reference to fig. 6, as an implementation of the method shown in the above-mentioned figures, the present application provides an embodiment of a web content extracting apparatus, which corresponds to the method embodiment shown in fig. 2 or fig. 3, and which can be applied in various electronic devices.
The web content extraction device of the present embodiment includes an analysis module 610, a feature extraction module 620, a picture web page identification module 630 and a picture content feature extraction module 640.
The parsing module 610 may be configured to parse the web page to be extracted to determine html tags included in the web page to be extracted.
The feature extraction module 620 may be configured to extract html features of a webpage to be extracted from the html tag.
The photo web page recognition module 630 may be configured to import the extracted html features into a pre-trained photo web page recognition model.
The picture content feature extraction module 640 may be configured to, in response to determining that the web page to be extracted is a picture web page, extract a picture in the web page to be extracted and an html tag corresponding to the picture.
In some optional implementations, the web page content extracting apparatus of this embodiment may further include a material web page identification module 650 and a material content feature extraction module 660.
The material web page identification module 650 may be configured to, in response to determining that the web page to be extracted is a non-picture web page, import the extracted html features into a pre-trained material web page identification model.
The material content feature extraction module 660 may be configured to extract pictures and text in the web page to be extracted in response to determining that the web page to be extracted is a material web page.
In some optional implementations, the feature extraction module 620 may be further configured to: screening out html text tags corresponding to texts of the webpage to be extracted from the html tags; and traversing each html text label of the webpage to be extracted to determine the html characteristics of the webpage to be extracted.
In some optional implementations, the html features include at least one of: the html text label with the category of the picture label accounts for the proportion of the html text label of the webpage to be extracted; the html text label with the category of the hyperlink label accounts for the proportion of the html text label of the webpage to be extracted; the html text label with the category of the form label accounts for the proportion of the html text label of the webpage to be extracted; and the text density of each html body label of the webpage to be extracted is the ratio of the text length contained in the html body label to the sum of the text lengths contained in the html body labels of the webpage to be extracted.
In some optional implementations, before parsing the web page to be extracted to determine the first html tag included in the web page to be extracted, the parsing module 610 may be further configured to: and in response to receiving the uniform resource locator URL of the webpage, analyzing each webpage belonging to the same website as the webpage to be extracted.
Those skilled in the art will appreciate that the web content extraction device 600 described above may also include some other well-known structures, such as a processor, memory, etc., which are not shown in fig. 6 in order to not unnecessarily obscure embodiments of the present disclosure.
Referring now to FIG. 7, shown is a block diagram of a computer system 700 suitable for use in implementing a terminal device or server of an embodiment of the present application.
As shown in fig. 7, the computer system 700 includes a Central Processing Unit (CPU)701, which can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)702 or a program loaded from a storage section 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data necessary for the operation of the system 700 are also stored. The CPU 701, the ROM 702, and the RAM 703 are connected to each other via a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.
The following components are connected to the I/O interface 705: an input portion 706 including a keyboard, a mouse, and the like; an output section 707 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 708 including a hard disk and the like; and a communication section 709 including a network interface card such as a LAN card, a modem, or the like. The communication section 709 performs communication processing via a network such as the internet. A drive 710 is also connected to the I/O interface 705 as needed. A removable medium 711 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 710 as necessary, so that a computer program read out therefrom is mounted into the storage section 708 as necessary.
In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program tangibly embodied on a machine-readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program can be downloaded and installed from a network through the communication section 709, and/or installed from the removable medium 711.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units described in the embodiments of the present application may be implemented by software or hardware. The described units may also be provided in a processor, and may be described as: a processor comprises an analysis module, a feature extraction module, a picture webpage identification module and a picture content feature extraction module. The names of the modules do not form a limitation on the modules themselves in some cases, for example, the parsing module may also be described as a module for parsing the web page to be extracted to determine the html tag contained in the web page to be extracted.
As another aspect, the present application also provides a non-volatile computer storage medium, which may be the non-volatile computer storage medium included in the apparatus in the above-described embodiments; or it may be a non-volatile computer storage medium that exists separately and is not incorporated into the terminal. The non-volatile computer storage medium stores one or more programs that, when executed by a device, cause the device to: analyzing the webpage to be extracted to determine a hypertext markup language html tag contained in the webpage to be extracted; extracting html features of a webpage to be extracted from the html tag; importing the extracted html features into a pre-trained picture webpage recognition model; and in response to the fact that the webpage to be extracted is determined to be the picture webpage, extracting the picture in the webpage to be extracted and the html tag corresponding to the picture.
The above description is only a preferred embodiment of the application and is illustrative of the principles of the technology employed. It will be appreciated by a person skilled in the art that the scope of the invention as referred to in the present application is not limited to the embodiments with a specific combination of the above-mentioned features, but also covers other embodiments with any combination of the above-mentioned features or their equivalents without departing from the inventive concept. For example, the above features may be replaced with (but not limited to) features having similar functions disclosed in the present application.

Claims (8)

1. A method for extracting web page content, comprising:
analyzing a webpage to be extracted to determine an html tag contained in the webpage to be extracted;
extracting html features of the webpage to be extracted from the html tag, wherein the html features comprise: screening out html text tags corresponding to the texts of the web pages to be extracted from the html tags; traversing each html text label of the webpage to be extracted to determine html features of the webpage to be extracted, wherein the html features include: the html text label with the category of the picture label accounts for the html text label of the webpage to be extracted, and the html text label with the category of the hyperlink label accounts for the html text label of the webpage to be extracted;
importing the extracted html features into a pre-trained picture webpage recognition model; and
in response to the fact that the webpage to be extracted is determined to be a picture webpage, pictures in the webpage to be extracted and html tags corresponding to the pictures are extracted;
the method further comprises the following steps: in response to the fact that the webpage to be extracted is determined to be a non-picture webpage, the extracted html features are led into a pre-trained material webpage identification model; and in response to the fact that the webpage to be extracted is determined to be a material webpage, extracting pictures and texts in the webpage to be extracted.
2. The method of claim 1, wherein the html features further comprise at least one of:
the html text label with the category of the form label accounts for the proportion of the html text label of the webpage to be extracted;
the text density of each html body label of the webpage to be extracted is the ratio of the text length contained in the html body label to the sum of the text lengths contained in the html body labels of the webpage to be extracted; and
and the statistical characteristics of the text density of each html text label of the webpage to be extracted.
3. The method according to any one of claims 1-2, wherein before the parsing the web page to be extracted to determine the first html tag contained in the web page to be extracted, the method further comprises:
and in response to receiving the uniform resource locator URL of the webpage, analyzing each webpage belonging to the same website as the webpage to be extracted.
4. A web content extraction apparatus, comprising:
the analysis module is configured for analyzing the webpage to be extracted to determine the html tag contained in the webpage to be extracted;
the feature extraction module is configured to extract html features of the webpage to be extracted from the html tag, wherein the html features include: the html text label with the category of the picture label accounts for the html text label of the webpage to be extracted, and the html text label with the category of the hyperlink label accounts for the html text label of the webpage to be extracted;
the picture webpage recognition module is configured for importing the extracted html features into a pre-trained picture webpage recognition model; and
the image content characteristic extraction module is configured to respond to the fact that the webpage to be extracted is the image webpage, and extract the image in the webpage to be extracted and the html tag corresponding to the image;
the feature extraction module is further configured to: screening out html text tags corresponding to the texts of the web pages to be extracted from the html tags; traversing each html text label of the webpage to be extracted to determine html characteristics of the webpage to be extracted;
the device further comprises: the material webpage identification module is configured for responding to the fact that the webpage to be extracted is determined to be a non-picture webpage, and the extracted html features are led into a pre-trained material webpage identification model; and the material content characteristic extraction module is configured to respond to the fact that the webpage to be extracted is determined to be a material webpage, and extract pictures and texts in the webpage to be extracted.
5. The apparatus of claim 4, wherein the html features further comprise at least one of:
the html text label with the category of the form label accounts for the proportion of the html text label of the webpage to be extracted;
the text density of each html body label of the webpage to be extracted is the ratio of the text length contained in the html body label to the sum of the text lengths contained in the html body labels of the webpage to be extracted; and
and the statistical characteristics of the text density of each html text label of the webpage to be extracted.
6. The apparatus according to any one of claims 4 to 5, wherein the parsing module, before parsing the web page to be extracted to determine the first html tag included in the web page to be extracted, is further configured to:
and in response to receiving the uniform resource locator URL of the webpage, analyzing each webpage belonging to the same website as the webpage to be extracted.
7. An electronic device, comprising:
one or more processors;
a storage device for storing one or more programs,
when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-3.
8. A computer-readable storage medium having stored thereon a computer program, characterized in that:
the program when executed by a processor implementing the method as claimed in any one of claims 1-3.
CN201710418653.5A 2017-06-06 2017-06-06 Webpage content extraction method and device Active CN107153716B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710418653.5A CN107153716B (en) 2017-06-06 2017-06-06 Webpage content extraction method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710418653.5A CN107153716B (en) 2017-06-06 2017-06-06 Webpage content extraction method and device

Publications (2)

Publication Number Publication Date
CN107153716A CN107153716A (en) 2017-09-12
CN107153716B true CN107153716B (en) 2021-01-01

Family

ID=59795886

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710418653.5A Active CN107153716B (en) 2017-06-06 2017-06-06 Webpage content extraction method and device

Country Status (1)

Country Link
CN (1) CN107153716B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110020296A (en) * 2017-10-31 2019-07-16 北京国双科技有限公司 A kind of method and device for extracting news web page text
WO2019090738A1 (en) * 2017-11-10 2019-05-16 深圳市华阅文化传媒有限公司 Method and device for purifying web fiction page
CN110309457B (en) * 2018-03-21 2023-06-16 腾讯科技(深圳)有限公司 Webpage data processing method, device, computer equipment and storage medium
CN109344884B (en) * 2018-09-14 2023-09-12 深圳市雅阅科技有限公司 Media information classification method, method and device for training picture classification model
CN111898058A (en) * 2019-05-06 2020-11-06 阿里巴巴集团控股有限公司 Personal homepage information processing method and device and electronic equipment
CN110727820B (en) * 2019-10-22 2022-11-04 杭州数澜科技有限公司 Method and system for obtaining label for picture
CN111061955B (en) * 2019-12-20 2023-11-07 深圳市朱墨科技有限公司 Webpage text extraction method and device, server and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102314497A (en) * 2011-08-26 2012-01-11 百度在线网络技术(北京)有限公司 Method and equipment for identifying body contents of markup language files
CN106326451A (en) * 2016-08-26 2017-01-11 武汉大学 Method for judging webpage sensing information block based on visual feature extraction

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090248707A1 (en) * 2008-03-25 2009-10-01 Yahoo! Inc. Site-specific information-type detection methods and systems
US20120260160A1 (en) * 2009-12-24 2012-10-11 Samsung Electronics Co., Ltd. Display device for displaying a webpage and display method for same
US9594730B2 (en) * 2010-07-01 2017-03-14 Yahoo! Inc. Annotating HTML segments with functional labels
CN102592067B (en) * 2011-01-17 2014-07-30 腾讯科技(深圳)有限公司 Webpage recognition method, device and system
CN102184189B (en) * 2011-04-18 2012-11-28 北京理工大学 Webpage core block determining method based on DOM (Document Object Model) node text density
CN102332028B (en) * 2011-10-15 2013-08-28 西安交通大学 Webpage-oriented unhealthy Web content identifying method
CN104809125A (en) * 2014-01-24 2015-07-29 腾讯科技(深圳)有限公司 Method and device for identifying webpage categories

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102314497A (en) * 2011-08-26 2012-01-11 百度在线网络技术(北京)有限公司 Method and equipment for identifying body contents of markup language files
CN106326451A (en) * 2016-08-26 2017-01-11 武汉大学 Method for judging webpage sensing information block based on visual feature extraction

Also Published As

Publication number Publication date
CN107153716A (en) 2017-09-12

Similar Documents

Publication Publication Date Title
CN107153716B (en) Webpage content extraction method and device
CN105677764B (en) Information extraction method and device
CN104766014B (en) For detecting the method and system of malice network address
US11062089B2 (en) Method and apparatus for generating information
CN109145280B (en) Information pushing method and device
CN102200971B (en) Method and equipment for realizing webpage content previewing
CN110765973B (en) Account type identification method and device
CN107241215B (en) User behavior prediction method and device
CN113038153B (en) Financial live broadcast violation detection method, device, equipment and readable storage medium
CN105488205A (en) Page generation method and page generation apparatus
CN107436843A (en) Webpage performance test methods and device
CN109325197B (en) Method and device for extracting information
CN111881398A (en) Page type determination method, device and equipment and computer storage medium
CN113918794B (en) Enterprise network public opinion benefit analysis method, system, electronic equipment and storage medium
CN109558123B (en) Method for converting webpage into electronic book, electronic equipment and storage medium
CN102902790A (en) Web page classification system and method
CN107329981B (en) Page detection method and device
JP5216654B2 (en) Importance determination device, importance determination method, and program
US20130230248A1 (en) Ensuring validity of the bookmark reference in a collaborative bookmarking system
CN116774973A (en) Data rendering method, device, computer equipment and storage medium
CN113806667B (en) Method and system for supporting webpage classification
CN114900492B (en) Abnormal mail detection method, device and system and computer readable storage medium
CN115564000A (en) Two-dimensional code generation method and device, computer equipment and storage medium
CN103823825A (en) Online content collection
CN114239689A (en) Multi-mode-based website type judgment method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant