CN112528205B

CN112528205B - Webpage main body information extraction method and device and storage medium

Info

Publication number: CN112528205B
Application number: CN202011531289.1A
Authority: CN
Inventors: 李玺; 冯凯; 王元卓
Original assignee: Big Data Research Institute Institute Of Computing Technology Chinese Academy Of Sciences
Current assignee: Big Data Research Institute Institute Of Computing Technology Chinese Academy Of Sciences
Priority date: 2020-12-22
Filing date: 2020-12-22
Publication date: 2021-10-29
Anticipated expiration: 2040-12-22
Also published as: CN112528205A

Abstract

The invention provides a webpage main body information extraction method, a device and a storage medium, wherein the webpage main body information extraction method comprises the following steps: acquiring a webpage source code, receiving webpage address information, and acquiring the webpage source code of the webpage in the Internet according to the webpage address information, wherein the webpage source code comprises at least one node which comprises at least one label; and extracting main information, traversing each node in the webpage source code, judging whether the information of the node is the main information or not according to the label in the node, and if so, extracting the information in the node as the main information. Firstly, the webpage is found in the Internet through webpage address information, a main body part in the webpage is identified through processing of a webpage source code, and when a user browses the webpage, the information of the main body part can be directly browsed, so that on one hand, the occupation of network resources by useless information of a non-main body part is reduced, and the utilization rate of the Internet resources is improved; and on the other hand, the efficiency of obtaining information by the user is improved.

Description

Webpage main body information extraction method and device and storage medium

Technical Field

The invention relates to the technical field of computer data mining, in particular to a method and a device for extracting webpage main body information and a storage medium.

Background

In the early period of this century, people mainly obtain outside information through media ways such as newspapers, radio stations, radio and television stations, and with the progress of science and technology, the information obtaining mode of modern people becomes various, and information can be obtained by browsing webpages from the internet through electronic equipment such as mobile phones or computers.

However, the current web page has a lot of useless data besides the main information, for example, the news web page has some advertisements besides the main data news, and the useless data not only occupies larger internet resources, but also affects the efficiency of the user to obtain information. Therefore, how to extract the main information in the web page, improve the utilization rate of internet resources, and improve the efficiency of obtaining information by the user is a problem that needs to be overcome urgently in the prior art.

Therefore, there is a need in the art for a method, an apparatus and a storage medium for extracting webpage main body information.

Accordingly, the present invention is directed to such a system.

Disclosure of Invention

The invention aims to provide a webpage main body information extraction method, which is used for extracting main body information in a webpage, improving the utilization rate of internet resources and improving the efficiency of obtaining information by a user.

The invention provides a webpage main body information extraction method, which comprises the following steps:

acquiring a webpage source code, receiving webpage address information, and acquiring the webpage source code of the webpage in the Internet according to the webpage address information, wherein the webpage source code comprises at least one node which comprises at least one label;

and extracting main information, traversing each node in the webpage source code, judging whether the information of the node is the main information or not according to the label in the node, and if so, extracting the information in the node as the main information.

By adopting the scheme, the webpage is found in the Internet through the webpage address information, the webpage source code is extracted, the main part in the webpage is identified through processing the webpage source code, and when a user browses the webpage, the information of the main part can be directly browsed, so that on one hand, the occupation of network resources by useless information of other non-main parts is reduced, and the utilization rate of the Internet resources is improved; and on the other hand, the efficiency of obtaining information by the user is improved.

Furthermore, the webpage main body information extraction method also comprises user pushing, wherein characters of the main body information in each node in the webpage source code are collected and pushed to the user.

By adopting the scheme, the main information in the webpage is extracted, and the main information in the webpage is directly pushed to the user, so that the user information acquisition efficiency is improved.

Further, the webpage is a webpage from which the main body information is to be extracted, and the webpage address information is an address of the webpage in the internet.

Preferably, the step of acquiring the webpage source code includes:

receiving webpage address information;

receiving a source code acquisition program;

and acquiring the webpage source codes of the webpage by using the source code acquisition program.

By adopting the scheme, the webpage source codes are collected by using the source code collection program, so that the collection efficiency is improved.

Furthermore, the webpage source code can be acquired by firstly analyzing and then manually operating.

Further, the source code capture program may use the python language for web page source code capture.

Preferably, the source code collector requests the web page source code using the requests method in python.

Further, the webpage is an HTML page, and the webpage source code is an HTML page source code.

Furthermore, at least one node is arranged in the HTML DOM of the HTML page, at least one tag is arranged in the node, the node in the source code of the web page can be a head node or a step node, and the data in the node can be main information data or useless advertisement data and the like; the tag can be a P tag or an a tag in HTML, and the P tag is a paragraph tag which can self-start a line of paragraphs and can be used as a box and can be defined independently; the a-tag defines a hyperlink for linking from one page to another.

Furthermore, the html (hyper Text Markup language) is called as hypertext Markup language, and is a Markup language, which includes a series of tags, through which the document format on the network can be unified, so that the scattered Internet resources are connected into a logic whole; the HTML DOM is an abbreviation of HTML Document Object Model, and the HTML DOM is a Document Object Model specifically adapted to HTML/XHTML. A person familiar with software development can understand the HTML DOM as an API for a web page. It treats each element in the web page as a single object, so that the elements in the web page can be acquired or edited by the computer language.

Preferably, the step of acquiring the webpage source code further includes saving the acquired webpage source code as a document.

Further, the collected webpage source code can be saved as an XML document or an HTML document.

Further, the tag may be a P-tag, and the step of extracting the body information further includes:

a node detection for receiving the total label number, the character string word number, the P label number and the character string word number of the P label in the node and obtaining a detection entropy value according to the total label number, the character string word number, the P label number and the character string word number of the P label in the node;

and judging the detection entropy, and judging whether the information of the node is main information according to the detection entropy.

By adopting the scheme, the detection entropy value is obtained according to the total number of the labels, the number of the character strings, the number of the P labels and the number of the character strings of the P labels in the node, wherein the smaller the detection entropy value is, the higher the proportion of the character part in the node is, and the larger the possibility that the information in the node is the main information is.

Preferably, the step of extracting the subject information further includes:

node scoring, namely receiving the detection entropy and obtaining the node scoring according to the detection entropy and the total label number in the node;

and grading judgment, namely receiving a grading threshold, comparing the node grading with the grading threshold, and judging whether the information of the node is main information according to a judgment result.

By adopting the scheme, the node score is obtained according to the detection entropy value obtained by calculation and the total label number in the node, and whether the information of the node is the main information or not is judged by comparing the node score with the score threshold value, so that the accuracy of main information identification is improved.

Further, the node is set as a node A, the total number of labels in the node A is Albs, the number of character strings in the node A is Astrs, the number of P labels in the node A is Aplbs, the number of character strings in the P labels in the node A is Apstrs, and the detection entropy value is sbut;

the calculation of the detection entropy value is according to the formula:

further, setting the node score as scut;

the node score is calculated according to the formula: scut ═ sbut_log10Albs*_logesbut₂。

Further, the tag extraction in the web page source code can be extracted through XPath, which is a language capable of finding information in XML documents. XPath can be used to traverse elements and attributes in XML documents, extracting html source code.

Furthermore, the webpage main body information extraction method further comprises text preprocessing, wherein the text preprocessing comprises the steps of traversing all labels and lines in the webpage source code, and filtering useless labels and lines in the webpage source code through a DOM analyzer.

Further, the webpage main body information extraction method further comprises release time extraction, and the release time extraction step comprises the step of processing the webpage source codes through a regular expression to extract the release time of the webpage.

By adopting the scheme, the publishing time of the webpage is accurately extracted.

Further, the regular expression is a logical formula for operating on character strings (including common characters (e.g., letters between a and z) and special characters (called "meta characters")), i.e., a "regular character string" is formed by using specific characters defined in advance and a combination of the specific characters, and the "regular character string" is used to express a filtering logic for the character strings. A regular expression is a text pattern that describes one or more strings of characters to be matched when searching for text.

Furthermore, the webpage main body information extraction method further comprises title extraction, wherein the title extraction step comprises traversing all tags in the webpage source code, extracting tags of which the tags are of a title type, and determining information in the title tags as title information.

Further, the title tag is a tag type in the HTML, and is used for defining a title of the document.

By adopting the scheme, the title of the document plays a role in overview, and a user can preliminarily know the content of the document through the label, accurately identify the title information and improve the browsing efficiency of the user information.

A second aspect of the present invention provides a webpage main body information extraction system, including:

the webpage source code acquisition module is used for receiving webpage address information and acquiring a webpage source code of a webpage in the Internet according to the webpage address information, wherein the webpage source code comprises at least one node, and the node comprises at least one label;

and the main information extraction module traverses each node in the webpage source code and is used for judging whether the information of the node is the main information or not according to the label in the node, and if so, extracting the information in the node as the main information.

Furthermore, the webpage main body information extraction system also comprises a user pushing module, which is used for collecting characters of main body information in each node in the webpage source code and pushing the characters to the user.

Preferably, the webpage source code obtaining module includes:

receiving webpage address information;

receiving a source code acquisition program;

Further, the tag may be a P-tag, and the subject information extraction module further includes:

the node detection module is used for receiving the total label number, the character string word number, the P label number and the character string word number of the P label in the node and obtaining a detection entropy value according to the total label number, the character string word number, the P label number and the character string word number of the P label in the node;

and the detection entropy judgment module is used for judging whether the information of the node is main information or not according to the detection entropy.

Further, the subject information extraction module further includes:

the node scoring module is used for receiving the detection entropy and obtaining a node score according to the detection entropy and the total label number in the node;

and the grading judgment module is used for receiving the grading threshold value, comparing the node grading with the grading threshold value and judging whether the information of the node is the main information or not according to the judgment result.

the calculation of the detection entropy value is according to the formula:

further, setting the node score as scut;

Furthermore, the webpage main body information extraction system also comprises a text preprocessing module, wherein the text preprocessing module traverses all labels and lines in the webpage source code and is used for filtering useless labels and lines in the webpage source code through a DOM analyzer.

Furthermore, the webpage main body information extraction system also comprises a release time extraction module, and the release time extraction module is used for processing the webpage source codes through a regular expression and extracting the release time of the webpage.

Furthermore, the webpage main body information extraction system further comprises a title extraction module, wherein the title extraction module is used for traversing all the tags in the webpage source codes, extracting the tags with the title types, and determining the information in the title tags as the title information.

A third aspect of the present invention provides a web page main body information extraction apparatus, including a memory, a processor, and a computer program stored on the memory and operable on the processor, wherein the processor implements the web page main body information extraction method when executing the program.

A fourth aspect of the present invention provides a storage medium including one or more programs executable by a processor to perform the above-described web page body information extraction method.

In conclusion, the invention has the following beneficial effects:

1. the webpage main body information extraction method comprises the steps of firstly finding a webpage in the Internet through webpage address information, extracting a webpage source code, identifying a main body part in the webpage through processing the webpage source code, and directly browsing the information of the main body part when a user browses the webpage, so that on one hand, the network resource occupation of useless information of other non-main body parts is reduced, and the utilization rate of the Internet resource is improved; on the other hand, the efficiency of obtaining information by the user is improved;

2. according to the webpage main information extraction method, the main information in the webpage is extracted, and the main information in the webpage is directly pushed to a user, so that the user information acquisition efficiency is improved;

3. the webpage main information extraction method of the invention obtains the detection entropy value according to the total label number, the character string number, the P label number and the character string number of the P label in the node, wherein the smaller the detection entropy value is, the higher the proportion of the Chinese character part in the node is, the larger the possibility that the information in the node is the main information is;

4. according to the webpage main information extraction method, the node score is obtained according to the detection entropy value obtained through calculation and the total label number in the node, whether the information of the node is main information or not is judged through comparison of the node score and the score threshold value, and the accuracy of main information identification is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a flowchart of an embodiment of a method for extracting webpage main body information according to the present invention;

FIG. 2 is a flowchart illustrating another embodiment of a method for extracting webpage main body information according to the present invention;

FIG. 3 is a flowchart of an embodiment of the main information extraction step of the present invention;

FIG. 4 is a flowchart illustrating a main information extracting step according to another embodiment of the present invention;

FIG. 5 is a flowchart illustrating a method for extracting webpage main body information according to a third embodiment of the present invention;

FIG. 6 is a flowchart illustrating a fourth embodiment of a method for extracting webpage main body information according to the present invention;

FIG. 7 is a flowchart illustrating steps of a preferred embodiment of a method for extracting webpage body information according to the present invention;

FIG. 8 is a diagram illustrating an embodiment of a system for extracting webpage body information according to the present invention;

FIG. 9 is a diagram illustrating another embodiment of a system for extracting webpage body information according to the present invention;

FIG. 10 is a schematic diagram of a module refinement of the webpage main body information extraction system according to the present invention.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the invention, as detailed in the appended claims.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this specification and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.

As shown in fig. 1 and 7, a first aspect of the present invention provides a method for extracting webpage main body information, including the following steps:

s100, acquiring a webpage source code, receiving webpage address information, and acquiring the webpage source code of the webpage in the Internet according to the webpage address information, wherein the webpage source code comprises at least one node which comprises at least one label;

in the specific implementation process, the web page address information is a web address, and the web address can be https:// blog.csdn.net/weixin _43582101/article/details/108078003, or http:// www.xinhuanet.com/2020-12/14/c _1126858623. htm.

S200, extracting main information, traversing each node in the webpage source code, judging whether the information of the node is the main information or not according to the label in the node, and if so, extracting the information in the node as the main information.

In a specific implementation process, each node in the webpage source code is traversed, and if the information in the node is not the main information, the information in the node does not need to be extracted.

As shown in fig. 2 and 7, in a preferred embodiment of the present invention, the method for extracting webpage main body information further includes S300, collecting characters of the main body information in each node in the webpage source code, and pushing the collected characters to the user.

In a specific implementation process, the webpage source code comprises a node, a label and characters, the main body information is the source code, and the pushing to the user is to push the characters in the main body information to the user.

In a specific implementation process, extracting the text in the main body information may be completed by an Xpath method or a regular expression.

In a specific implementation process, in S300, the user pushing may directly push the main information in the webpage to the user while the user opens the webpage.

In a specific implementation process, the web page may be a web page in websites such as Tencent news, today's headlines, Xinhua networks, people's networks, WeChat articles, microblog articles, blog forums, and the like.

In a specific implementation process, the webpage is a webpage from which main body information is to be extracted, and the webpage address information is an address of the webpage in the internet.

In a specific implementation process, the step of S100, acquiring a webpage source code includes:

receiving webpage address information;

receiving a source code acquisition program;

In a specific implementation process, the webpage source code can be acquired by firstly analyzing and then manually operating.

In a specific implementation, the source code capture program may use a python language to capture the source code of the web page.

In a preferred embodiment of the invention, the source code picker requests web page source code using the requests method in python.

In a specific implementation process, the webpage is an HTML page, and the webpage source code is an HTML page source code.

In a specific implementation process, at least one node is arranged in an HTML DOM of an HTML page, at least one tag is arranged in the node, the node in the source code of the web page can be a head node or a step node, and data in the node can be main information data or useless advertisement data and the like; the tag can be a P tag or an a tag in HTML, and the P tag is a paragraph tag which can self-start a line of paragraphs and can be used as a box and can be defined independently; the a-tag defines a hyperlink for linking from one page to another.

In the specific implementation process, the html (hyper Text Markup language) is called as a hypertext Markup language, is an identifying language, and comprises a series of tags, and the tags can unify the document format on the network, so that the scattered Internet resources are connected into a logic whole; the HTML DOM is an abbreviation of HTML Document Object Model, and the HTML DOM is a Document Object Model specifically adapted to HTML/XHTML. A person familiar with software development can understand the HTML DOM as an API for a web page. It treats each element in the web page as a single object, so that the elements in the web page can be acquired or edited by the computer language.

In a preferred embodiment of the present invention, the step of acquiring the webpage source code further includes saving the acquired webpage source code as a document.

In a specific implementation process, the collected webpage source codes can be saved as an XML document or an HTML document.

As shown in fig. 3 and 7, in a specific implementation process, the tag may be a paragraph tag, the paragraph tag is a P tag, and the step of S200 extracting the main body information further includes:

s210, detecting a node, receiving the total label number, the character string word number, the P label number and the character string word number of the P label in the node, and obtaining a detection entropy value according to the total label number, the character string word number, the P label number and the character string word number of the P label in the node;

in the specific implementation process, the character string number statistics can be carried out through javascript.

And S220, judging a detection entropy value, and judging whether the information of the node is main information or not according to the detection entropy value.

In a specific implementation process, the step of S220, detecting an entropy value and determining further includes:

and receiving an entropy threshold, and judging whether the detection entropy is smaller than the entropy threshold, wherein if yes, the information of the node is main information.

As shown in fig. 4 and 7, in a preferred embodiment of the present invention, the step S200 of extracting the subject information further includes:

s230, node scoring, namely receiving the detection entropy and obtaining the node scoring according to the detection entropy and the total label number in the node;

s240, grading judgment, namely receiving a grading threshold value, comparing the node grading with the grading threshold value, and judging whether the information of the node is main information or not according to a judgment result.

In a specific implementation process, the scoring and determining step further includes: and receiving the scoring threshold value, and judging whether the node score is greater than the scoring threshold value, wherein if yes, the information of the node is main information.

In the specific implementation process, the node is set as a node A, the total number of labels in the node A is Albs, the number of character strings in the node A is Astrs, the number of P labels in the node A is Aplbs, the number of character strings in the P labels in the node A is Apstrs, and the detection entropy value is sbut;

the calculation of the detection entropy value is according to the formula:

in the specific implementation process, in the webpage with the webpage address information of https:// blog, csdn, net/weixin _43582101/article/details/108078003, a total of 112 tags are arranged in one node of the webpage, wherein 27 p tags comprise 2544 characters, wherein 2250 is arranged in the p tags, and then the webpage address information is https:// blog, csdn, net/weixin _43582101/article/details/108078003

sbut＝(2544-2250)/(112-27)＝3.4588。

In a specific implementation process, the entropy threshold may be 5, and 3.4588 is less than 5, and then the information in the node is the subject information.

In a specific implementation process, setting the node score as scut;

In the specific implementation process, the sbut₂The detection entropy values of other labels in the node are detected, and the other labels can be any one of picture labels, table labels or punctuation labels.

In the specific implementation process, if the other tags are picture tags, the picture tags are set as I tags,

the AIstrs are the number of character string words of the P label in the node A, and the AIlbs is the number of I labels in the node A.

In the specific implementation process, sbut is calculated₂The value may be 2.86, sbut 3.4588, Albs 112, and sct sbut_log10 Albs*_logesbut₂；

To obtain

scut≈7.16；

In a specific implementation process, the scoring threshold may be 5, and 7.16 > 5, then the information of the node is the subject information.

In a specific implementation process, if other tags in the node, such as a picture tag, have a large proportion, the information in the node may also be the main information.

In the specific implementation process, the extraction of the tags in the webpage source code can be extracted through XPath, X_Path is a language in which a gate can look up information in an XML document. XPath can be used to traverse elements and attributes in XML documents, extracting html source code.

As shown in fig. 5 and 7, in a preferred embodiment of the present invention, the method for extracting webpage main body information further includes S110, text preprocessing, where the text preprocessing includes traversing all tags and lines in the webpage source code, and filtering useless tags and lines in the webpage source code through a DOM parser.

In a specific implementation process, the useless tags can be advertisement tags, related push tags and the like; the useless lines may be comments in the web page.

As shown in fig. 6 and 7, in a preferred embodiment of the present invention, the method for extracting webpage main body information further includes S400, and extracting release time, where the step of extracting release time includes processing the webpage source code through a regular expression to extract the release time of the webpage.

In the specific implementation process, the webpage release time for extracting the webpage address information as https:// blog.csdn.net/weixin _43582101/ar tile/details/108078003 is 8/18/2020.

In one implementation, the regular expression is a logical formula for operating on a character string (including common characters (e.g., letters between a and z) and special characters (called "meta characters")), that is, a "rule character string" is formed by using specific characters defined in advance and a combination of the specific characters, and the "rule character string" is used to express a filtering logic for the character string. A regular expression is a text pattern that describes one or more strings of characters to be matched when searching for text.

As shown in fig. 6 and 7, in a preferred embodiment of the present invention, the method for extracting webpage main body information further includes S500, and title extraction, where the title extraction includes traversing all tags in the webpage source code, extracting tags whose tags are of a title type, and determining information in the title tags as title information.

In the specific implementation process, the webpage title of which the webpage address information is https:// blog.csdn.net/weixin _43582101/ar tile/details/108078003 is extracted as ARM assembly basic knowledge.

In a specific implementation process, the title tag is a tag type in HTML and is used for defining a title of a document.

In the specific implementation process, the extracted title information and the release time are both pushed to the user.

In a specific implementation process, the method can be used for rapidly extracting a plurality of webpages, a traditional public opinion crawler usually needs a development engineer to crawl data of hundreds and thousands of news sites, and if the traditional public opinion crawler is realized in a traditional mode, each site needs to be configured with a great number of html page parsing rules.

In a specific implementation process, the method can intelligently analyze news data of all sites, and a large amount of time and labor cost are saved.

As shown in fig. 8, a second aspect of the present invention provides a webpage main body information extraction system, including:

a web page source code obtaining module 100, configured to receive web page address information, and obtain a web page source code of a web page in the internet according to the web page address information, where the web page source code includes at least one node, and the node includes at least one tag;

and the main information extraction module 200 traverses each node in the webpage source code, and is configured to judge whether the information of the node is main information according to the label in the node, and if so, extract the information in the node as main information.

As shown in fig. 9, in a preferred embodiment of the present invention, the webpage main body information extracting system further includes a user pushing module 300, configured to collect characters of main body information in each node in the webpage source code, and push the collected characters to a user.

In a specific implementation process, the webpage source code obtaining module 100 includes:

receiving webpage address information;

receiving a source code acquisition program;

As shown in fig. 10, in a specific implementation process, the tag may be a P-tag, and the main body information extraction module 200 further includes:

a node detection module 210, configured to receive the total tag number, the character string word number, the P tag number, and the character string word number of the P tag in the node, and obtain a detection entropy value according to the total tag number, the character string word number, the P tag number, and the character string word number of the P tag in the node;

and a detection entropy determining module 220, configured to determine whether the information of the node is main information according to the detection entropy.

In a specific implementation process, the main body information extraction module 200 further includes:

a node scoring module 230, configured to receive the detection entropy, and obtain a node score according to the detection entropy and a total number of tags in the node;

and the scoring judgment module 240 is configured to receive the scoring threshold, compare the node score with the scoring threshold, and judge whether the information of the node is the main information according to the judgment result.

the calculation of the detection entropy value is according to the formula:

in a specific implementation process, setting the node score as scut;

In a preferred embodiment of the present invention, the webpage main body information extraction system further includes a text preprocessing module 110, which traverses all tags and lines in the webpage source code and is used for filtering useless tags and lines in the webpage source code through a DOM parser.

In a preferred embodiment of the present invention, the system for extracting webpage main body information further includes an issue time extraction module 400, where the issue time extraction is used to process the webpage source code through a regular expression and extract the issue time of the webpage.

In a preferred embodiment of the present invention, the webpage main body information extraction system further includes a title extraction module 500, where the title extraction module is configured to traverse all tags in the webpage source code, extract tags of a title type, and determine information in the title tags as title information.

It should be noted that, for those skilled in the art, it is possible to make several improvements and modifications to the present invention without departing from the principle of the present invention, and those improvements and modifications also fall within the protection scope of the claims of the present invention.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

It should be understood that the technical problems can be solved by combining and combining the features of the embodiments from the claims.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A webpage main body information extraction method is characterized by comprising the following steps:

extracting main information, traversing each node in the webpage source code, judging whether the information of the node is the main information according to the label in the node, and if so, extracting the information in the node as the main information;

the label is a P label, and the step of extracting the body information further includes:

judging the detection entropy, receiving an entropy threshold, judging whether the detection entropy is smaller than the entropy threshold, if so, the information of the node is main information;

setting the node as a node A, wherein the total label number in the node A is Albs, the character string word number in the node A is Astrs, the P label number in the node A is Aplbs, the character string word number of the P label in the node A is Apstrs, and the detection entropy value is sbut;

the calculation of the detection entropy value is according to the formula:

2. the web page main body information extraction method according to claim 1, characterized in that: the webpage main body information extraction method also comprises user pushing, wherein the user pushing is used for collecting characters of main body information in each node in the webpage source code and pushing the characters to the user.

3. The web page main body information extraction method according to claim 2, characterized in that: the step of extracting the subject information further comprises:

4. The web page main body information extraction method according to claim 3, characterized in that: setting the node score as scut;

5. The web page main body information extraction method according to claim 4, characterized in that: the webpage main body information extraction method further comprises text preprocessing, wherein the text preprocessing comprises the steps of traversing all labels and lines in the webpage source code, and filtering useless labels and lines in the webpage source code through a DOM analyzer.

6. The web page main body information extraction method according to claim 1 or 5, characterized in that: the webpage main body information extraction method further comprises title extraction, wherein the title extraction step comprises traversing all the tags in the webpage source codes, extracting the tags of which the tags are of the title type, and determining the information in the title tags as the title information.

7. A web page main body information extraction apparatus comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the web page main body information extraction method according to any one of claims 1 to 6 when executing the program.

8. A storage medium comprising one or more programs executable by a processor to perform the web page body information extraction method according to any one of claims 1 to 6.