CN112528205B - Webpage main body information extraction method and device and storage medium - Google Patents

Webpage main body information extraction method and device and storage medium Download PDF

Info

Publication number
CN112528205B
CN112528205B CN202011531289.1A CN202011531289A CN112528205B CN 112528205 B CN112528205 B CN 112528205B CN 202011531289 A CN202011531289 A CN 202011531289A CN 112528205 B CN112528205 B CN 112528205B
Authority
CN
China
Prior art keywords
node
webpage
information
main body
source code
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011531289.1A
Other languages
Chinese (zh)
Other versions
CN112528205A (en
Inventor
李玺
冯凯
王元卓
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Big Data Research Institute Institute Of Computing Technology Chinese Academy Of Sciences
Original Assignee
Big Data Research Institute Institute Of Computing Technology Chinese Academy Of Sciences
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Big Data Research Institute Institute Of Computing Technology Chinese Academy Of Sciences filed Critical Big Data Research Institute Institute Of Computing Technology Chinese Academy Of Sciences
Priority to CN202011531289.1A priority Critical patent/CN112528205B/en
Publication of CN112528205A publication Critical patent/CN112528205A/en
Application granted granted Critical
Publication of CN112528205B publication Critical patent/CN112528205B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention provides a webpage main body information extraction method, a device and a storage medium, wherein the webpage main body information extraction method comprises the following steps: acquiring a webpage source code, receiving webpage address information, and acquiring the webpage source code of the webpage in the Internet according to the webpage address information, wherein the webpage source code comprises at least one node which comprises at least one label; and extracting main information, traversing each node in the webpage source code, judging whether the information of the node is the main information or not according to the label in the node, and if so, extracting the information in the node as the main information. Firstly, the webpage is found in the Internet through webpage address information, a main body part in the webpage is identified through processing of a webpage source code, and when a user browses the webpage, the information of the main body part can be directly browsed, so that on one hand, the occupation of network resources by useless information of a non-main body part is reduced, and the utilization rate of the Internet resources is improved; and on the other hand, the efficiency of obtaining information by the user is improved.

Description

Webpage main body information extraction method and device and storage medium
Technical Field
The invention relates to the technical field of computer data mining, in particular to a method and a device for extracting webpage main body information and a storage medium.
Background
In the early period of this century, people mainly obtain outside information through media ways such as newspapers, radio stations, radio and television stations, and with the progress of science and technology, the information obtaining mode of modern people becomes various, and information can be obtained by browsing webpages from the internet through electronic equipment such as mobile phones or computers.
However, the current web page has a lot of useless data besides the main information, for example, the news web page has some advertisements besides the main data news, and the useless data not only occupies larger internet resources, but also affects the efficiency of the user to obtain information. Therefore, how to extract the main information in the web page, improve the utilization rate of internet resources, and improve the efficiency of obtaining information by the user is a problem that needs to be overcome urgently in the prior art.
Therefore, there is a need in the art for a method, an apparatus and a storage medium for extracting webpage main body information.
Accordingly, the present invention is directed to such a system.
Disclosure of Invention
The invention aims to provide a webpage main body information extraction method, which is used for extracting main body information in a webpage, improving the utilization rate of internet resources and improving the efficiency of obtaining information by a user.
The invention provides a webpage main body information extraction method, which comprises the following steps:
acquiring a webpage source code, receiving webpage address information, and acquiring the webpage source code of the webpage in the Internet according to the webpage address information, wherein the webpage source code comprises at least one node which comprises at least one label;
and extracting main information, traversing each node in the webpage source code, judging whether the information of the node is the main information or not according to the label in the node, and if so, extracting the information in the node as the main information.
By adopting the scheme, the webpage is found in the Internet through the webpage address information, the webpage source code is extracted, the main part in the webpage is identified through processing the webpage source code, and when a user browses the webpage, the information of the main part can be directly browsed, so that on one hand, the occupation of network resources by useless information of other non-main parts is reduced, and the utilization rate of the Internet resources is improved; and on the other hand, the efficiency of obtaining information by the user is improved.
Furthermore, the webpage main body information extraction method also comprises user pushing, wherein characters of the main body information in each node in the webpage source code are collected and pushed to the user.
By adopting the scheme, the main information in the webpage is extracted, and the main information in the webpage is directly pushed to the user, so that the user information acquisition efficiency is improved.
Further, the webpage is a webpage from which the main body information is to be extracted, and the webpage address information is an address of the webpage in the internet.
Preferably, the step of acquiring the webpage source code includes:
receiving webpage address information;
receiving a source code acquisition program;
and acquiring the webpage source codes of the webpage by using the source code acquisition program.
By adopting the scheme, the webpage source codes are collected by using the source code collection program, so that the collection efficiency is improved.
Furthermore, the webpage source code can be acquired by firstly analyzing and then manually operating.
Further, the source code capture program may use the python language for web page source code capture.
Preferably, the source code collector requests the web page source code using the requests method in python.
Further, the webpage is an HTML page, and the webpage source code is an HTML page source code.
Furthermore, at least one node is arranged in the HTML DOM of the HTML page, at least one tag is arranged in the node, the node in the source code of the web page can be a head node or a step node, and the data in the node can be main information data or useless advertisement data and the like; the tag can be a P tag or an a tag in HTML, and the P tag is a paragraph tag which can self-start a line of paragraphs and can be used as a box and can be defined independently; the a-tag defines a hyperlink for linking from one page to another.
Furthermore, the html (hyper Text Markup language) is called as hypertext Markup language, and is a Markup language, which includes a series of tags, through which the document format on the network can be unified, so that the scattered Internet resources are connected into a logic whole; the HTML DOM is an abbreviation of HTML Document Object Model, and the HTML DOM is a Document Object Model specifically adapted to HTML/XHTML. A person familiar with software development can understand the HTML DOM as an API for a web page. It treats each element in the web page as a single object, so that the elements in the web page can be acquired or edited by the computer language.
Preferably, the step of acquiring the webpage source code further includes saving the acquired webpage source code as a document.
Further, the collected webpage source code can be saved as an XML document or an HTML document.
Further, the tag may be a P-tag, and the step of extracting the body information further includes:
a node detection for receiving the total label number, the character string word number, the P label number and the character string word number of the P label in the node and obtaining a detection entropy value according to the total label number, the character string word number, the P label number and the character string word number of the P label in the node;
and judging the detection entropy, and judging whether the information of the node is main information according to the detection entropy.
By adopting the scheme, the detection entropy value is obtained according to the total number of the labels, the number of the character strings, the number of the P labels and the number of the character strings of the P labels in the node, wherein the smaller the detection entropy value is, the higher the proportion of the character part in the node is, and the larger the possibility that the information in the node is the main information is.
Preferably, the step of extracting the subject information further includes:
node scoring, namely receiving the detection entropy and obtaining the node scoring according to the detection entropy and the total label number in the node;
and grading judgment, namely receiving a grading threshold, comparing the node grading with the grading threshold, and judging whether the information of the node is main information according to a judgment result.
By adopting the scheme, the node score is obtained according to the detection entropy value obtained by calculation and the total label number in the node, and whether the information of the node is the main information or not is judged by comparing the node score with the score threshold value, so that the accuracy of main information identification is improved.
Further, the node is set as a node A, the total number of labels in the node A is Albs, the number of character strings in the node A is Astrs, the number of P labels in the node A is Aplbs, the number of character strings in the P labels in the node A is Apstrs, and the detection entropy value is sbut;
the calculation of the detection entropy value is according to the formula:
Figure BDA0002852181750000031
further, setting the node score as scut;
the node score is calculated according to the formula: scut ═ sbutlog10Albs*logesbut2
Further, the tag extraction in the web page source code can be extracted through XPath, which is a language capable of finding information in XML documents. XPath can be used to traverse elements and attributes in XML documents, extracting html source code.
Furthermore, the webpage main body information extraction method further comprises text preprocessing, wherein the text preprocessing comprises the steps of traversing all labels and lines in the webpage source code, and filtering useless labels and lines in the webpage source code through a DOM analyzer.
Further, the webpage main body information extraction method further comprises release time extraction, and the release time extraction step comprises the step of processing the webpage source codes through a regular expression to extract the release time of the webpage.
By adopting the scheme, the publishing time of the webpage is accurately extracted.
Further, the regular expression is a logical formula for operating on character strings (including common characters (e.g., letters between a and z) and special characters (called "meta characters")), i.e., a "regular character string" is formed by using specific characters defined in advance and a combination of the specific characters, and the "regular character string" is used to express a filtering logic for the character strings. A regular expression is a text pattern that describes one or more strings of characters to be matched when searching for text.
Furthermore, the webpage main body information extraction method further comprises title extraction, wherein the title extraction step comprises traversing all tags in the webpage source code, extracting tags of which the tags are of a title type, and determining information in the title tags as title information.
Further, the title tag is a tag type in the HTML, and is used for defining a title of the document.
By adopting the scheme, the title of the document plays a role in overview, and a user can preliminarily know the content of the document through the label, accurately identify the title information and improve the browsing efficiency of the user information.
A second aspect of the present invention provides a webpage main body information extraction system, including:
the webpage source code acquisition module is used for receiving webpage address information and acquiring a webpage source code of a webpage in the Internet according to the webpage address information, wherein the webpage source code comprises at least one node, and the node comprises at least one label;
and the main information extraction module traverses each node in the webpage source code and is used for judging whether the information of the node is the main information or not according to the label in the node, and if so, extracting the information in the node as the main information.
Furthermore, the webpage main body information extraction system also comprises a user pushing module, which is used for collecting characters of main body information in each node in the webpage source code and pushing the characters to the user.
Preferably, the webpage source code obtaining module includes:
receiving webpage address information;
receiving a source code acquisition program;
and acquiring the webpage source codes of the webpage by using the source code acquisition program.
Further, the tag may be a P-tag, and the subject information extraction module further includes:
the node detection module is used for receiving the total label number, the character string word number, the P label number and the character string word number of the P label in the node and obtaining a detection entropy value according to the total label number, the character string word number, the P label number and the character string word number of the P label in the node;
and the detection entropy judgment module is used for judging whether the information of the node is main information or not according to the detection entropy.
Further, the subject information extraction module further includes:
the node scoring module is used for receiving the detection entropy and obtaining a node score according to the detection entropy and the total label number in the node;
and the grading judgment module is used for receiving the grading threshold value, comparing the node grading with the grading threshold value and judging whether the information of the node is the main information or not according to the judgment result.
Further, the node is set as a node A, the total number of labels in the node A is Albs, the number of character strings in the node A is Astrs, the number of P labels in the node A is Aplbs, the number of character strings in the P labels in the node A is Apstrs, and the detection entropy value is sbut;
the calculation of the detection entropy value is according to the formula:
Figure BDA0002852181750000051
further, setting the node score as scut;
the node score is calculated according to the formula: scut ═ sbutlog10Albs*logesbut2
Furthermore, the webpage main body information extraction system also comprises a text preprocessing module, wherein the text preprocessing module traverses all labels and lines in the webpage source code and is used for filtering useless labels and lines in the webpage source code through a DOM analyzer.
Furthermore, the webpage main body information extraction system also comprises a release time extraction module, and the release time extraction module is used for processing the webpage source codes through a regular expression and extracting the release time of the webpage.
Furthermore, the webpage main body information extraction system further comprises a title extraction module, wherein the title extraction module is used for traversing all the tags in the webpage source codes, extracting the tags with the title types, and determining the information in the title tags as the title information.
A third aspect of the present invention provides a web page main body information extraction apparatus, including a memory, a processor, and a computer program stored on the memory and operable on the processor, wherein the processor implements the web page main body information extraction method when executing the program.
A fourth aspect of the present invention provides a storage medium including one or more programs executable by a processor to perform the above-described web page body information extraction method.
In conclusion, the invention has the following beneficial effects:
1. the webpage main body information extraction method comprises the steps of firstly finding a webpage in the Internet through webpage address information, extracting a webpage source code, identifying a main body part in the webpage through processing the webpage source code, and directly browsing the information of the main body part when a user browses the webpage, so that on one hand, the network resource occupation of useless information of other non-main body parts is reduced, and the utilization rate of the Internet resource is improved; on the other hand, the efficiency of obtaining information by the user is improved;
2. according to the webpage main information extraction method, the main information in the webpage is extracted, and the main information in the webpage is directly pushed to a user, so that the user information acquisition efficiency is improved;
3. the webpage main information extraction method of the invention obtains the detection entropy value according to the total label number, the character string number, the P label number and the character string number of the P label in the node, wherein the smaller the detection entropy value is, the higher the proportion of the Chinese character part in the node is, the larger the possibility that the information in the node is the main information is;
4. according to the webpage main information extraction method, the node score is obtained according to the detection entropy value obtained through calculation and the total label number in the node, whether the information of the node is main information or not is judged through comparison of the node score and the score threshold value, and the accuracy of main information identification is improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a flowchart of an embodiment of a method for extracting webpage main body information according to the present invention;
FIG. 2 is a flowchart illustrating another embodiment of a method for extracting webpage main body information according to the present invention;
FIG. 3 is a flowchart of an embodiment of the main information extraction step of the present invention;
FIG. 4 is a flowchart illustrating a main information extracting step according to another embodiment of the present invention;
FIG. 5 is a flowchart illustrating a method for extracting webpage main body information according to a third embodiment of the present invention;
FIG. 6 is a flowchart illustrating a fourth embodiment of a method for extracting webpage main body information according to the present invention;
FIG. 7 is a flowchart illustrating steps of a preferred embodiment of a method for extracting webpage body information according to the present invention;
FIG. 8 is a diagram illustrating an embodiment of a system for extracting webpage body information according to the present invention;
FIG. 9 is a diagram illustrating another embodiment of a system for extracting webpage body information according to the present invention;
FIG. 10 is a schematic diagram of a module refinement of the webpage main body information extraction system according to the present invention.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the invention, as detailed in the appended claims.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this specification and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.
As shown in fig. 1 and 7, a first aspect of the present invention provides a method for extracting webpage main body information, including the following steps:
s100, acquiring a webpage source code, receiving webpage address information, and acquiring the webpage source code of the webpage in the Internet according to the webpage address information, wherein the webpage source code comprises at least one node which comprises at least one label;
in the specific implementation process, the web page address information is a web address, and the web address can be https:// blog.csdn.net/weixin _43582101/article/details/108078003, or http:// www.xinhuanet.com/2020-12/14/c _1126858623. htm.
S200, extracting main information, traversing each node in the webpage source code, judging whether the information of the node is the main information or not according to the label in the node, and if so, extracting the information in the node as the main information.
In a specific implementation process, each node in the webpage source code is traversed, and if the information in the node is not the main information, the information in the node does not need to be extracted.
By adopting the scheme, the webpage is found in the Internet through the webpage address information, the webpage source code is extracted, the main part in the webpage is identified through processing the webpage source code, and when a user browses the webpage, the information of the main part can be directly browsed, so that on one hand, the occupation of network resources by useless information of other non-main parts is reduced, and the utilization rate of the Internet resources is improved; and on the other hand, the efficiency of obtaining information by the user is improved.
As shown in fig. 2 and 7, in a preferred embodiment of the present invention, the method for extracting webpage main body information further includes S300, collecting characters of the main body information in each node in the webpage source code, and pushing the collected characters to the user.
In a specific implementation process, the webpage source code comprises a node, a label and characters, the main body information is the source code, and the pushing to the user is to push the characters in the main body information to the user.
In a specific implementation process, extracting the text in the main body information may be completed by an Xpath method or a regular expression.
In a specific implementation process, in S300, the user pushing may directly push the main information in the webpage to the user while the user opens the webpage.
In a specific implementation process, the web page may be a web page in websites such as Tencent news, today's headlines, Xinhua networks, people's networks, WeChat articles, microblog articles, blog forums, and the like.
By adopting the scheme, the main information in the webpage is extracted, and the main information in the webpage is directly pushed to the user, so that the user information acquisition efficiency is improved.
In a specific implementation process, the webpage is a webpage from which main body information is to be extracted, and the webpage address information is an address of the webpage in the internet.
In a specific implementation process, the step of S100, acquiring a webpage source code includes:
receiving webpage address information;
receiving a source code acquisition program;
and acquiring the webpage source codes of the webpage by using the source code acquisition program.
By adopting the scheme, the webpage source codes are collected by using the source code collection program, so that the collection efficiency is improved.
In a specific implementation process, the webpage source code can be acquired by firstly analyzing and then manually operating.
In a specific implementation, the source code capture program may use a python language to capture the source code of the web page.
In a preferred embodiment of the invention, the source code picker requests web page source code using the requests method in python.
In a specific implementation process, the webpage is an HTML page, and the webpage source code is an HTML page source code.
In a specific implementation process, at least one node is arranged in an HTML DOM of an HTML page, at least one tag is arranged in the node, the node in the source code of the web page can be a head node or a step node, and data in the node can be main information data or useless advertisement data and the like; the tag can be a P tag or an a tag in HTML, and the P tag is a paragraph tag which can self-start a line of paragraphs and can be used as a box and can be defined independently; the a-tag defines a hyperlink for linking from one page to another.
In the specific implementation process, the html (hyper Text Markup language) is called as a hypertext Markup language, is an identifying language, and comprises a series of tags, and the tags can unify the document format on the network, so that the scattered Internet resources are connected into a logic whole; the HTML DOM is an abbreviation of HTML Document Object Model, and the HTML DOM is a Document Object Model specifically adapted to HTML/XHTML. A person familiar with software development can understand the HTML DOM as an API for a web page. It treats each element in the web page as a single object, so that the elements in the web page can be acquired or edited by the computer language.
In a preferred embodiment of the present invention, the step of acquiring the webpage source code further includes saving the acquired webpage source code as a document.
In a specific implementation process, the collected webpage source codes can be saved as an XML document or an HTML document.
As shown in fig. 3 and 7, in a specific implementation process, the tag may be a paragraph tag, the paragraph tag is a P tag, and the step of S200 extracting the main body information further includes:
s210, detecting a node, receiving the total label number, the character string word number, the P label number and the character string word number of the P label in the node, and obtaining a detection entropy value according to the total label number, the character string word number, the P label number and the character string word number of the P label in the node;
in the specific implementation process, the character string number statistics can be carried out through javascript.
And S220, judging a detection entropy value, and judging whether the information of the node is main information or not according to the detection entropy value.
In a specific implementation process, the step of S220, detecting an entropy value and determining further includes:
and receiving an entropy threshold, and judging whether the detection entropy is smaller than the entropy threshold, wherein if yes, the information of the node is main information.
By adopting the scheme, the detection entropy value is obtained according to the total number of the labels, the number of the character strings, the number of the P labels and the number of the character strings of the P labels in the node, wherein the smaller the detection entropy value is, the higher the proportion of the character part in the node is, and the larger the possibility that the information in the node is the main information is.
As shown in fig. 4 and 7, in a preferred embodiment of the present invention, the step S200 of extracting the subject information further includes:
s230, node scoring, namely receiving the detection entropy and obtaining the node scoring according to the detection entropy and the total label number in the node;
s240, grading judgment, namely receiving a grading threshold value, comparing the node grading with the grading threshold value, and judging whether the information of the node is main information or not according to a judgment result.
In a specific implementation process, the scoring and determining step further includes: and receiving the scoring threshold value, and judging whether the node score is greater than the scoring threshold value, wherein if yes, the information of the node is main information.
By adopting the scheme, the node score is obtained according to the detection entropy value obtained by calculation and the total label number in the node, and whether the information of the node is the main information or not is judged by comparing the node score with the score threshold value, so that the accuracy of main information identification is improved.
In the specific implementation process, the node is set as a node A, the total number of labels in the node A is Albs, the number of character strings in the node A is Astrs, the number of P labels in the node A is Aplbs, the number of character strings in the P labels in the node A is Apstrs, and the detection entropy value is sbut;
the calculation of the detection entropy value is according to the formula:
Figure BDA0002852181750000091
in the specific implementation process, in the webpage with the webpage address information of https:// blog, csdn, net/weixin _43582101/article/details/108078003, a total of 112 tags are arranged in one node of the webpage, wherein 27 p tags comprise 2544 characters, wherein 2250 is arranged in the p tags, and then the webpage address information is https:// blog, csdn, net/weixin _43582101/article/details/108078003
sbut=(2544-2250)/(112-27)=3.4588。
In a specific implementation process, the entropy threshold may be 5, and 3.4588 is less than 5, and then the information in the node is the subject information.
In a specific implementation process, setting the node score as scut;
the node score is calculated according to the formula: scut ═ sbutlog10Albs*logesbut2
In the specific implementation process, the sbut2The detection entropy values of other labels in the node are detected, and the other labels can be any one of picture labels, table labels or punctuation labels.
In the specific implementation process, if the other tags are picture tags, the picture tags are set as I tags,
Figure BDA0002852181750000101
the AIstrs are the number of character string words of the P label in the node A, and the AIlbs is the number of I labels in the node A.
In the specific implementation process, sbut is calculated2The value may be 2.86, sbut 3.4588, Albs 112, and sct sbutlog10 Albs*logesbut2
To obtain
Figure BDA0002852181750000102
scut≈7.16;
In a specific implementation process, the scoring threshold may be 5, and 7.16 > 5, then the information of the node is the subject information.
In a specific implementation process, if other tags in the node, such as a picture tag, have a large proportion, the information in the node may also be the main information.
In the specific implementation process, the extraction of the tags in the webpage source code can be extracted through XPath, XPath is a language in which a gate can look up information in an XML document. XPath can be used to traverse elements and attributes in XML documents, extracting html source code.
As shown in fig. 5 and 7, in a preferred embodiment of the present invention, the method for extracting webpage main body information further includes S110, text preprocessing, where the text preprocessing includes traversing all tags and lines in the webpage source code, and filtering useless tags and lines in the webpage source code through a DOM parser.
In a specific implementation process, the useless tags can be advertisement tags, related push tags and the like; the useless lines may be comments in the web page.
As shown in fig. 6 and 7, in a preferred embodiment of the present invention, the method for extracting webpage main body information further includes S400, and extracting release time, where the step of extracting release time includes processing the webpage source code through a regular expression to extract the release time of the webpage.
In the specific implementation process, the webpage release time for extracting the webpage address information as https:// blog.csdn.net/weixin _43582101/ar tile/details/108078003 is 8/18/2020.
By adopting the scheme, the publishing time of the webpage is accurately extracted.
In one implementation, the regular expression is a logical formula for operating on a character string (including common characters (e.g., letters between a and z) and special characters (called "meta characters")), that is, a "rule character string" is formed by using specific characters defined in advance and a combination of the specific characters, and the "rule character string" is used to express a filtering logic for the character string. A regular expression is a text pattern that describes one or more strings of characters to be matched when searching for text.
As shown in fig. 6 and 7, in a preferred embodiment of the present invention, the method for extracting webpage main body information further includes S500, and title extraction, where the title extraction includes traversing all tags in the webpage source code, extracting tags whose tags are of a title type, and determining information in the title tags as title information.
In the specific implementation process, the webpage title of which the webpage address information is https:// blog.csdn.net/weixin _43582101/ar tile/details/108078003 is extracted as ARM assembly basic knowledge.
In a specific implementation process, the title tag is a tag type in HTML and is used for defining a title of a document.
By adopting the scheme, the title of the document plays a role in overview, and a user can preliminarily know the content of the document through the label, accurately identify the title information and improve the browsing efficiency of the user information.
In the specific implementation process, the extracted title information and the release time are both pushed to the user.
In a specific implementation process, the method can be used for rapidly extracting a plurality of webpages, a traditional public opinion crawler usually needs a development engineer to crawl data of hundreds and thousands of news sites, and if the traditional public opinion crawler is realized in a traditional mode, each site needs to be configured with a great number of html page parsing rules.
In a specific implementation process, the method can intelligently analyze news data of all sites, and a large amount of time and labor cost are saved.
As shown in fig. 8, a second aspect of the present invention provides a webpage main body information extraction system, including:
a web page source code obtaining module 100, configured to receive web page address information, and obtain a web page source code of a web page in the internet according to the web page address information, where the web page source code includes at least one node, and the node includes at least one tag;
and the main information extraction module 200 traverses each node in the webpage source code, and is configured to judge whether the information of the node is main information according to the label in the node, and if so, extract the information in the node as main information.
As shown in fig. 9, in a preferred embodiment of the present invention, the webpage main body information extracting system further includes a user pushing module 300, configured to collect characters of main body information in each node in the webpage source code, and push the collected characters to a user.
In a specific implementation process, the webpage source code obtaining module 100 includes:
receiving webpage address information;
receiving a source code acquisition program;
and acquiring the webpage source codes of the webpage by using the source code acquisition program.
As shown in fig. 10, in a specific implementation process, the tag may be a P-tag, and the main body information extraction module 200 further includes:
a node detection module 210, configured to receive the total tag number, the character string word number, the P tag number, and the character string word number of the P tag in the node, and obtain a detection entropy value according to the total tag number, the character string word number, the P tag number, and the character string word number of the P tag in the node;
and a detection entropy determining module 220, configured to determine whether the information of the node is main information according to the detection entropy.
In a specific implementation process, the main body information extraction module 200 further includes:
a node scoring module 230, configured to receive the detection entropy, and obtain a node score according to the detection entropy and a total number of tags in the node;
and the scoring judgment module 240 is configured to receive the scoring threshold, compare the node score with the scoring threshold, and judge whether the information of the node is the main information according to the judgment result.
In the specific implementation process, the node is set as a node A, the total number of labels in the node A is Albs, the number of character strings in the node A is Astrs, the number of P labels in the node A is Aplbs, the number of character strings in the P labels in the node A is Apstrs, and the detection entropy value is sbut;
the calculation of the detection entropy value is according to the formula:
Figure BDA0002852181750000121
in a specific implementation process, setting the node score as scut;
the node score is calculated according to the formula: scut ═ sbutlog10Albs*logesbut2
In a preferred embodiment of the present invention, the webpage main body information extraction system further includes a text preprocessing module 110, which traverses all tags and lines in the webpage source code and is used for filtering useless tags and lines in the webpage source code through a DOM parser.
In a preferred embodiment of the present invention, the system for extracting webpage main body information further includes an issue time extraction module 400, where the issue time extraction is used to process the webpage source code through a regular expression and extract the issue time of the webpage.
In a preferred embodiment of the present invention, the webpage main body information extraction system further includes a title extraction module 500, where the title extraction module is configured to traverse all tags in the webpage source code, extract tags of a title type, and determine information in the title tags as title information.
A third aspect of the present invention provides a web page main body information extraction apparatus, including a memory, a processor, and a computer program stored on the memory and operable on the processor, wherein the processor implements the web page main body information extraction method when executing the program.
A fourth aspect of the present invention provides a storage medium including one or more programs executable by a processor to perform the above-described web page body information extraction method.
It should be noted that, for those skilled in the art, it is possible to make several improvements and modifications to the present invention without departing from the principle of the present invention, and those improvements and modifications also fall within the protection scope of the claims of the present invention.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.
It should be understood that the technical problems can be solved by combining and combining the features of the embodiments from the claims.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (8)

1. A webpage main body information extraction method is characterized by comprising the following steps:
acquiring a webpage source code, receiving webpage address information, and acquiring the webpage source code of the webpage in the Internet according to the webpage address information, wherein the webpage source code comprises at least one node which comprises at least one label;
extracting main information, traversing each node in the webpage source code, judging whether the information of the node is the main information according to the label in the node, and if so, extracting the information in the node as the main information;
the label is a P label, and the step of extracting the body information further includes:
a node detection for receiving the total label number, the character string word number, the P label number and the character string word number of the P label in the node and obtaining a detection entropy value according to the total label number, the character string word number, the P label number and the character string word number of the P label in the node;
judging the detection entropy, receiving an entropy threshold, judging whether the detection entropy is smaller than the entropy threshold, if so, the information of the node is main information;
setting the node as a node A, wherein the total label number in the node A is Albs, the character string word number in the node A is Astrs, the P label number in the node A is Aplbs, the character string word number of the P label in the node A is Apstrs, and the detection entropy value is sbut;
the calculation of the detection entropy value is according to the formula:
Figure FDA0003199491310000011
2. the web page main body information extraction method according to claim 1, characterized in that: the webpage main body information extraction method also comprises user pushing, wherein the user pushing is used for collecting characters of main body information in each node in the webpage source code and pushing the characters to the user.
3. The web page main body information extraction method according to claim 2, characterized in that: the step of extracting the subject information further comprises:
node scoring, namely receiving the detection entropy and obtaining the node scoring according to the detection entropy and the total label number in the node;
and grading judgment, namely receiving a grading threshold, comparing the node grading with the grading threshold, and judging whether the information of the node is main information according to a judgment result.
4. The web page main body information extraction method according to claim 3, characterized in that: setting the node score as scut;
the node score is calculated according to the formula: scut ═ sbutlog10Albs*logesbut2
5. The web page main body information extraction method according to claim 4, characterized in that: the webpage main body information extraction method further comprises text preprocessing, wherein the text preprocessing comprises the steps of traversing all labels and lines in the webpage source code, and filtering useless labels and lines in the webpage source code through a DOM analyzer.
6. The web page main body information extraction method according to claim 1 or 5, characterized in that: the webpage main body information extraction method further comprises title extraction, wherein the title extraction step comprises traversing all the tags in the webpage source codes, extracting the tags of which the tags are of the title type, and determining the information in the title tags as the title information.
7. A web page main body information extraction apparatus comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the web page main body information extraction method according to any one of claims 1 to 6 when executing the program.
8. A storage medium comprising one or more programs executable by a processor to perform the web page body information extraction method according to any one of claims 1 to 6.
CN202011531289.1A 2020-12-22 2020-12-22 Webpage main body information extraction method and device and storage medium Active CN112528205B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011531289.1A CN112528205B (en) 2020-12-22 2020-12-22 Webpage main body information extraction method and device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011531289.1A CN112528205B (en) 2020-12-22 2020-12-22 Webpage main body information extraction method and device and storage medium

Publications (2)

Publication Number Publication Date
CN112528205A CN112528205A (en) 2021-03-19
CN112528205B true CN112528205B (en) 2021-10-29

Family

ID=75002369

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011531289.1A Active CN112528205B (en) 2020-12-22 2020-12-22 Webpage main body information extraction method and device and storage medium

Country Status (1)

Country Link
CN (1) CN112528205B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102332028A (en) * 2011-10-15 2012-01-25 西安交通大学 Webpage-oriented unhealthy Web content identifying method
CN102426600A (en) * 2011-11-08 2012-04-25 军工思波信息科技产业有限公司 Intranet information acquisition method based on meta-search
CN105653668A (en) * 2015-12-29 2016-06-08 武汉理工大学 Webpage content analysis and extraction optimization method based on DOM Tree in cloud environment
CN106528583A (en) * 2015-11-14 2017-03-22 孙燕群 Method for extracting and comparing web page main body
CN107391675A (en) * 2017-07-21 2017-11-24 百度在线网络技术(北京)有限公司 Method and apparatus for generating structure information
CN109543126A (en) * 2018-11-19 2019-03-29 四川长虹电器股份有限公司 Web page text information extracting method based on block text accounting
CN109885743A (en) * 2019-01-04 2019-06-14 上海七印信息科技有限公司 A kind of webpage data information extracting method

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101819584B (en) * 2010-03-18 2011-11-09 上海引跑信息科技有限公司 Light weight intelligent webpage content analysis method
CN102768663A (en) * 2011-05-05 2012-11-07 腾讯科技(深圳)有限公司 Method and device for extracting webpage title and information processing system
CN102929871A (en) * 2011-08-08 2013-02-13 腾讯科技(深圳)有限公司 Webpage browsing method and device and mobile terminal
US20130173610A1 (en) * 2011-12-29 2013-07-04 Microsoft Corporation Extracting Search-Focused Key N-Grams and/or Phrases for Relevance Rankings in Searches
US20160055243A1 (en) * 2014-08-22 2016-02-25 Ut Battelle, Llc Web crawler for acquiring content
CN110457579B (en) * 2019-07-30 2022-03-22 四川大学 Webpage denoising method and system based on cooperative work of template and classifier
CN110532563B (en) * 2019-09-02 2023-06-20 苏州美能华智能科技有限公司 Method and device for detecting key paragraphs in text

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102332028A (en) * 2011-10-15 2012-01-25 西安交通大学 Webpage-oriented unhealthy Web content identifying method
CN102426600A (en) * 2011-11-08 2012-04-25 军工思波信息科技产业有限公司 Intranet information acquisition method based on meta-search
CN106528583A (en) * 2015-11-14 2017-03-22 孙燕群 Method for extracting and comparing web page main body
CN105653668A (en) * 2015-12-29 2016-06-08 武汉理工大学 Webpage content analysis and extraction optimization method based on DOM Tree in cloud environment
CN107391675A (en) * 2017-07-21 2017-11-24 百度在线网络技术(北京)有限公司 Method and apparatus for generating structure information
CN109543126A (en) * 2018-11-19 2019-03-29 四川长虹电器股份有限公司 Web page text information extracting method based on block text accounting
CN109885743A (en) * 2019-01-04 2019-06-14 上海七印信息科技有限公司 A kind of webpage data information extracting method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Using Entropy Label for Network Slice Identification in MPLS;B.Decraene, Ed. et al.;《http://www.watersprings.org/pub/id/draft-decraene-mpls-slid-encoded-entropy-label-id-00.html》;20201216;1-5 *
一种基于信息熵的Web页面主题信息抽取方法;贺智平 等;《计算机工程与应用》;20070430;第43卷(第4期);164-166 *

Also Published As

Publication number Publication date
CN112528205A (en) 2021-03-19

Similar Documents

Publication Publication Date Title
US8185530B2 (en) Method and system for web document clustering
Weninger et al. CETR: content extraction via tag ratios
US9448999B2 (en) Method and device to detect similar documents
CN103577466B (en) Method and device for displaying webpage content in browser
US9141697B2 (en) Method, system and computer-readable storage medium for detecting trap of web-based perpetual calendar and building retrieval database using the same
CN109543126B (en) Webpage text information extraction method based on block character ratio
EP2291812A2 (en) Forum web page clustering based on repetitive regions
WO2014153457A1 (en) Merging web page style addresses
CN107590288B (en) Method and device for extracting webpage image-text blocks
JP2005063432A (en) Multimedia object retrieval apparatus and multimedia object retrieval method
CN102915361A (en) Webpage text extracting method based on character distribution characteristic
CN111460803B (en) Equipment identification method based on Web management page of industrial Internet of things equipment
CN111339457A (en) Method and apparatus for extracting information from web page and storage medium
CN112232075A (en) Article release time identification method based on time format and webpage element characteristics
CN107145591B (en) Title-based webpage effective metadata content extraction method
CN104572874B (en) A kind of abstracting method and device of webpage information
Yu et al. Web content information extraction based on DOM tree and statistical information
WO2015074455A1 (en) Method and apparatus for computing url pattern of associated webpage
CN108694192B (en) Webpage type judging method and device
CN112528205B (en) Webpage main body information extraction method and device and storage medium
CN106897287B (en) Webpage release time extraction method and device for webpage release time extraction
JP5317638B2 (en) Web document main content extraction apparatus and program
CN115238078A (en) Webpage information extraction method, device, equipment and storage medium
CN109388665B (en) Method and system for on-line mining of author relationship
CN111078976A (en) Medical system crawler-based data extraction method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant