CN115391711B - Webpage text information extraction method, device, equipment and medium - Google Patents

Webpage text information extraction method, device, equipment and medium Download PDF

Info

Publication number
CN115391711B
CN115391711B CN202211330508.9A CN202211330508A CN115391711B CN 115391711 B CN115391711 B CN 115391711B CN 202211330508 A CN202211330508 A CN 202211330508A CN 115391711 B CN115391711 B CN 115391711B
Authority
CN
China
Prior art keywords
webpage
text
dom
text information
nodes
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211330508.9A
Other languages
Chinese (zh)
Other versions
CN115391711A (en
Inventor
孙太凤
郭行飞
刘永丹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhongxin Kuanwei Media Technology Co ltd
Original Assignee
Zhongxin Kuanwei Media Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhongxin Kuanwei Media Technology Co ltd filed Critical Zhongxin Kuanwei Media Technology Co ltd
Priority to CN202211330508.9A priority Critical patent/CN115391711B/en
Publication of CN115391711A publication Critical patent/CN115391711A/en
Application granted granted Critical
Publication of CN115391711B publication Critical patent/CN115391711B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/14Tree-structured documents
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities

Abstract

The application provides a webpage text information extraction method, a device, equipment and a medium, which are applied to the technical field of information extraction, wherein the webpage text information extraction method comprises the following steps: acquiring a DOM tree corresponding to a webpage to be processed; determining a webpage text area where webpage text information in a webpage to be processed is located according to text complexity of text information corresponding to each DOM node in the DOM tree; and extracting webpage text information from the webpage text area. The text information of the webpage is identified and extracted through the text complexity of the text information, so that the text information can be accurately and efficiently identified and extracted from the webpage, a website for a specific webpage structure is not needed, and the method has strong universality, better robustness and higher accuracy.

Description

Webpage text information extraction method, device, equipment and medium
Technical Field
The application relates to the technical field of information extraction, in particular to a method, a device, equipment and a medium for extracting webpage text information.
Background
With the development of internet technology, the number of internet sites and page information has increased explosively, and the internet has also become a main way for people to obtain information. However, due to business concerns, these sites, while providing valuable information to people, may also contain other information, such as advertisements or links to other web sites. Such content can not only degrade the reading experience of people, but also reduce search engine efficiency.
In the process of extracting webpage information, the existing scheme cannot be used for extracting the webpage information in a complex and variable webpage structure, is only suitable for the webpage with a specific webpage structure, and has low universality, low extraction efficiency and accuracy and poor robustness.
Therefore, a new scheme for extracting web page information is required.
Disclosure of Invention
In view of this, embodiments of the present disclosure provide a method, an apparatus, a device, and a medium for extracting text information of a web page, which are not specific to a website with a specific web page structure, and are applicable to both a single-record-type web page similar to a news website and a multi-record-type web page such as a hectometer bar, and have better robustness and higher accuracy.
The embodiment of the specification provides the following technical scheme:
an embodiment of the present specification provides a method for extracting webpage text information, including:
acquiring a DOM tree corresponding to a webpage to be processed, wherein DOM nodes in the DOM tree comprise text information corresponding to the DOM nodes;
determining a webpage text area where webpage text information in a webpage to be processed is located according to text complexity of text information corresponding to each DOM node in the DOM tree;
and extracting webpage text information from the webpage text area.
Preferably, the obtaining of the DOM tree corresponding to the webpage to be processed includes:
analyzing and rendering a webpage document and a resource corresponding to a webpage to be processed through a browser or a browser kernel;
after the webpage to be processed is analyzed and rendered, acquiring DOM structure information and text information corresponding to the webpage to be processed;
and constructing a DOM tree corresponding to the webpage to be processed according to the DOM structure information and the text information.
Preferably, before analyzing and rendering the webpage document and the resource corresponding to the webpage to be processed, the method for extracting the webpage text information further includes:
receiving a webpage address corresponding to a webpage to be processed through a browser or a browser kernel;
and acquiring the webpage document and the resource corresponding to the target link according to the target link corresponding to the webpage address.
Preferably, after acquiring the DOM tree corresponding to the webpage to be processed, the method for extracting the text information of the webpage further includes: deleting DOM nodes irrelevant to the text information of the webpage and all descendant nodes thereof from the DOM tree;
and combining DOM nodes belonging to the same paragraph label into a DOM node corresponding to the paragraph label according to the label of the DOM node.
Preferably, determining a webpage body area where the webpage body information in the webpage to be processed is located according to the text complexity of the text information corresponding to each DOM node in the DOM tree, includes:
for each DOM node in the DOM tree, if the leaf node of the DOM node contains text content, extracting the text content;
performing word segmentation processing, part-of-speech tagging processing and dependency syntactic analysis processing on the text content to obtain a processing result;
obtaining text complexity corresponding to the DOM node according to the ratio of the connecting words, the height of the syntax tree and the number of the clauses corresponding to the processing result;
sequentially traversing all DOM nodes, and if the text complexity of the DOM nodes is greater than a complexity threshold value, adding the corresponding DOM nodes into the candidate text set;
and determining a webpage text area where the webpage text information in the webpage to be processed is located in the candidate text set.
Preferably, determining a webpage text area where the webpage text information in the webpage to be processed is located in the candidate text set includes:
grouping DOM nodes in the candidate text set;
and determining a webpage text area where the webpage text information in the webpage to be processed is located according to the summation result of the text complexity corresponding to the multiple DOM nodes in each group.
Preferably, grouping DOM nodes in the candidate text collection includes:
acquiring the distance between the left side of each DOM node in the candidate text set and the page boundary;
dividing DOM nodes with the same distance into the same group.
Preferably, determining a web page body area where web page body information in the web page to be processed is located according to a summation result of text complexities corresponding to the multiple DOM nodes in each group, includes:
and marking the webpage area corresponding to the lowest common ancestor node of all DOM nodes in the group corresponding to the highest summation result as a webpage text area.
An embodiment of the present specification further provides a device for extracting text information of a web page, including:
an acquisition module: acquiring a DOM tree corresponding to a webpage to be processed, wherein DOM nodes in the DOM tree comprise text information corresponding to the DOM nodes;
a processing module: determining a webpage text area where webpage text information in a webpage to be processed is located according to the text complexity of the text information corresponding to each DOM node in the DOM tree;
an extraction module: and extracting webpage text information from the webpage text area.
Preferably, the acquisition module comprises:
an analysis module: analyzing and rendering a webpage document and a resource corresponding to a webpage to be processed through a browser or a browser kernel;
constructing a module: after the webpage to be processed is analyzed and rendered, acquiring DOM structure information and text information corresponding to the webpage to be processed; and constructing a DOM tree corresponding to the webpage to be processed according to the DOM structure information and the text information.
Preferably, before analyzing and rendering the web document and the resource corresponding to the web page to be processed, the web page text information extracting apparatus further includes:
a receiving module: receiving a webpage address corresponding to a webpage to be processed through a browser or a browser kernel; and acquiring the webpage document and the resource corresponding to the target link according to the target link corresponding to the webpage address.
Preferably, after acquiring the DOM tree corresponding to the webpage to be processed, the webpage text information extracting apparatus further includes:
a deletion module: deleting DOM nodes irrelevant to the text information of the webpage and all descendant nodes thereof from the DOM tree;
a merging module: and combining DOM nodes belonging to the same paragraph label into a DOM node corresponding to the paragraph label according to the label of the DOM node.
Preferably, the processing module comprises:
a first processing submodule: for each DOM node in the DOM tree, if leaf nodes of the DOM nodes contain text contents, extracting the text contents; performing word segmentation processing, part-of-speech tagging processing and dependency syntactic analysis processing on the text content to obtain a processing result;
a second processing submodule: obtaining the text complexity corresponding to the DOM node according to the connecting word proportion, the syntax tree height and the number of clauses corresponding to the processing result; sequentially traversing all DOM nodes, and if the text complexity of the DOM nodes is greater than a complexity threshold value, adding the corresponding DOM nodes into the candidate text set;
a third processing submodule: and determining a webpage text area where the webpage text information in the webpage to be processed is located in the candidate text set.
Preferably, the third processing sub-module comprises:
a grouping unit: grouping DOM nodes in the candidate text set;
a summing unit: and determining a webpage text area where the webpage text information in the webpage to be processed is located according to the sum result of the text complexity corresponding to the multiple DOM nodes in each group.
Preferably, the grouping unit includes:
a grouping subunit: acquiring the distance between the left side of each DOM node in the candidate text set and the page boundary; dividing DOM nodes with the same distance into the same group.
Preferably, the summing unit includes:
a summation subunit: and marking the webpage area corresponding to the lowest common ancestor node of all DOM nodes in the group corresponding to the highest summation result as a webpage text area.
An embodiment of the present specification further provides an electronic device, including: at least one processor; and a memory communicatively coupled to the at least one processor; the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor to enable the at least one processor to execute the webpage text information extraction method.
The embodiment of the specification further provides a computer storage medium, which stores computer-executable instructions, and the computer-executable instructions are executed by a processor to execute the webpage text information extraction method.
Compared with the prior art, the beneficial effects that can be achieved by the at least one technical scheme adopted by the embodiment of the specification at least comprise: the method comprises the steps of sequentially traversing each DOM node by utilizing corresponding text information of the DOM nodes, and determining a webpage text area where webpage text information is located according to text complexity of the text information.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a schematic flowchart of a method for extracting text information of a web page according to an embodiment of the present disclosure;
fig. 2 is a schematic structural diagram of an apparatus for extracting text information of a web page provided in an embodiment of this specification.
Detailed Description
The embodiments of the present application will be described in detail below with reference to the accompanying drawings.
The following description of the embodiments of the present application is provided by way of specific examples, and other advantages and effects of the present application will be readily apparent to those skilled in the art from the disclosure herein. It is to be understood that the embodiments described are only a few embodiments of the present application and not all embodiments. The present application is capable of other and different embodiments and its several details are capable of modifications and/or changes in various respects, all without departing from the spirit of the present application. It should be noted that the features in the following embodiments and examples may be combined with each other without conflict. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
It is noted that various aspects of the embodiments are described below within the scope of the appended claims. It should be apparent that the aspects described herein may be embodied in a wide variety of forms and that any specific structure and/or function described herein is merely illustrative. Based on the present application, one skilled in the art should appreciate that one aspect described herein may be implemented independently of any other aspects and that two or more of these aspects may be combined in various ways. For example, an apparatus may be implemented and/or a method practiced using any number and aspects set forth herein. Additionally, such an apparatus may be implemented and/or such a method may be practiced using other structure and/or functionality in addition to one or more of the aspects set forth herein.
It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present application, and the drawings only show the components related to the present application rather than the number, shape and size of the components in actual implementation, and the type, amount and ratio of the components in actual implementation may be changed arbitrarily, and the layout of the components may be more complicated.
In addition, in the following description, specific details are provided to provide a thorough understanding of the examples. However, it will be understood by those skilled in the art that the present invention may be practiced without these specific details.
At present, common methods for extracting web page information include: manual-based extraction methods, visual-based extraction methods, statistical-based extraction methods, and machine learning and deep learning-based extraction methods. The manual-based extraction method has high accuracy, but high manual participation, long time consumption and high maintenance cost, and is not suitable for the scene of extracting mass target website information; the extraction method based on vision can analyze the page from the view of people's vision browsing, and improve the accuracy of information extraction, but with the development of networks, the structure in the World Wide Web (WEB) page becomes more complex, the information also becomes not single, the boundaries between blocks also become fuzzy, and the extraction difficulty also becomes gradually larger; the webpage information extraction method based on the statistical characteristics has good expansibility, but has certain limitation in universality; the method based on machine learning and deep learning can find some more hidden webpage features on the basis of statistical features, so that a better extraction effect can be achieved, but model training needs to label training data in advance, and the model is trained and predicted under webpages of specific types and scenes, so that the problem of universality exists.
In view of this, the inventor finds out that, by deeply researching and improving various websites, web page structures and page information thereof, and various extraction schemes: when the existing scheme faces to the webpage with a gradually complex webpage structure, the text information of the webpage cannot be accurately and effectively identified and extracted, and the problems of low universality, poor accuracy, low efficiency and the like in practical application of the existing scheme are caused.
Therefore, if the area where the webpage text information is located can be identified according to the characteristics of the webpage text information in the webpage, and then the webpage text information is extracted, the universality, the accuracy, the efficiency and the like of the webpage text information extraction scheme can be improved substantially.
Based on this, the embodiment of the present specification provides a processing scheme for extracting text information of a web page: each DOM node on the DOM tree corresponds to an area on the webpage, the text complexity of the area where the webpage text information is located is generally high, the area where the webpage text information is located can be extracted according to the text complexity, and then the webpage text information is extracted, so that the webpage text information can be accurately and quickly extracted.
The technical solutions provided by the embodiments of the present application are described below with reference to the accompanying drawings.
As shown in fig. 1, an embodiment of the present specification provides a method for extracting text information of a web page, including:
step 1: and acquiring a DOM tree corresponding to the webpage to be processed, wherein DOM nodes in the DOM tree comprise text information corresponding to the DOM nodes.
In which, according to the World Wide Web Consortium (W3C) standard, the text content of a Web page is generally in a body tag, so that a DOM tree can be constructed using the body tag as a root node, the < body > tag represents the main part of the Web page, and the content in the body tag is visible to the user.
The text information is a main body part of the HTML webpage and can comprise data information such as tags, texts, links, media and the like in a webpage document, and can be directly obtained from the HTML document through natural language processing. The tags refer to HTML semantic tags, which usually appear in pairs, such as HTML tags of < div >, < table >, and the like; the Text refers to Text (Content Text) embedded in a node in the DOM tree; link refers to the number of links in the text (Link Num); media refers to pictures, multimedia, and the like.
It should be noted that each web page can be parsed into a DOM tree, all tags are nodes, and characters and pictures are leaf nodes. Here, when acquiring the DOM tree, the DOM tree may be externally input DOM tree data information, or may be a DOM tree corresponding to a web page obtained by locally processing the web page.
And 2, step: and determining a webpage body area where the webpage body information in the webpage to be processed is located according to the text complexity of the text information corresponding to each DOM node in the DOM tree.
Each DOM node on the DOM tree corresponds to an area on a webpage, text information corresponds to each area, the text complexity of each DOM node can be determined according to the text information, and the area belonging to the body information of the webpage, namely the body area of the webpage, can be accurately determined according to the text complexity.
And step 3: and extracting the webpage text information from the webpage text area.
Specifically, the text information in the text area of the web page determined by the steps is extracted to obtain the text information of the web page.
The webpage text information extraction method provided by the embodiment of the specification improves the universality of the extraction scheme in the face of different types of webpages, also improves the webpage text area identification efficiency, and can ensure that the extracted nodes belong to text area nodes with higher accuracy.
In some optional embodiments, obtaining a DOM tree corresponding to a to-be-processed web page includes: analyzing and rendering a webpage document and a resource corresponding to a webpage to be processed through a browser or a browser kernel; after the webpage to be processed is analyzed and rendered, acquiring DOM structure information and text information corresponding to the webpage to be processed; and constructing a DOM tree corresponding to the webpage to be processed according to the DOM structure information and the text information.
Specifically, the browser or the browser kernel may parse the HTML code of the web page to be processed according to the syntax structure, and provide the running environment of the script and the visual presentation of the web page after parsing and rendering the web page document and the resource corresponding to the web page to be processed, so as to obtain the DOM tree, where DOM nodes in the DOM tree may have DOM structure information and text information. Specifically, a webpage document and a resource corresponding to a webpage to be processed are analyzed and rendered through a browser or a browser kernel; and after the webpage to be processed is analyzed and rendered, acquiring DOM structure information and text information corresponding to the webpage to be processed, and constructing a DOM tree corresponding to the webpage to be processed according to the DOM structure information.
It needs to be known that, the DOM tree with text information is directly obtained through a browser or a browser kernel, so that a processing scheme can conveniently build the DOM tree for an actual webpage, and the DOM tree is better suitable for applications of different webpage types.
In an embodiment of this specification, before parsing and rendering a webpage document and a resource corresponding to a webpage to be processed, the method for extracting webpage text information further includes: receiving a webpage address corresponding to a webpage to be processed through a browser or a browser kernel; and acquiring the webpage document and the resource corresponding to the target link according to the target link corresponding to the webpage address.
Specifically, the webpage resource corresponding to the specified URL can be downloaded directly based on the browser or the browser kernel, so that the DOM tree can be generated quickly at the local end.
Further, after acquiring the DOM tree corresponding to the webpage to be processed, the method for extracting the text information of the webpage further comprises the following steps: deleting DOM nodes irrelevant to the text information of the webpage and all descendant nodes thereof from the DOM tree; and merging the DOM nodes belonging to the same paragraph label into the DOM node corresponding to the paragraph label according to the label of the DOM node.
Specifically, according to the specific meaning of the HTML tag, the DOM node and all descendant nodes thereof that are irrelevant to the body information of the web page are deleted from the DOM tree, such as: footer (footer), style (style), page navigation (nav), sidebar area (aside), script (script), scalable vector graphics (svg), meta (meta), and the like.
In addition, the webpage text information is generally placed in a paragraph (< p >) tag, and the < p > tag generally contains a hyperlink (< a >) tag, an inline (< span >) tag, an emphatic (< strong >) tag and the like, but the contents are also the webpage text information. For the convenience of subsequent processing, the < a >, < strong > and < span > tags are merged with the parent node into a unit, namely, the < a >, < strong > and < span > tags are merged with the DOM node where the < p > tag is located into a unit, wherein the < p > tag is the parent node of the DOM node where the < a >, < strong > and < span > tags are located.
It can be known that, in addition to only one root node, the DOM tree includes multiple child nodes and multiple leaf nodes, and a parent node can be found for each child node, so that a node corresponding to a < p > tag may belong to the same parent node as other nodes, but in this embodiment of the present specification, only the child nodes under the node corresponding to the < p > tag are merged onto the node corresponding to the < p > tag, and no merging is performed for other nodes.
Generally, compared with noise contents such as navigation and advertisement, the text information of the web page generally uses a complex sentence pattern and a certain modifying means, so the text information of the web page generally has higher text complexity. In the embodiment of the description, a natural language platform is used for processing the text information of each DOM node in the webpage to be processed, and then the position of the text information of the webpage is determined according to the text complexity score of the DOM node, wherein the position of the text information of the webpage can be a webpage text area.
In some optional embodiments, determining a webpage body area where webpage body information in a webpage to be processed is located according to text complexity of text information corresponding to each DOM node in a DOM tree includes: for each DOM node in the DOM tree, if the leaf node of the DOM node contains text content, extracting the text content; performing word segmentation processing, part-of-speech tagging processing and dependency syntactic analysis processing on the text content to obtain a processing result; obtaining text complexity corresponding to the DOM node according to the ratio of the connecting words, the height of the syntax tree and the number of the clauses corresponding to the processing result; sequentially traversing all DOM nodes, and if the text complexity of the DOM nodes is greater than a complexity threshold value, adding the corresponding DOM nodes into the candidate text set; and determining a webpage text area where the webpage text information in the webpage to be processed is located in the candidate text set.
The setting of the complexity threshold in the embodiments of the present specification is not limited, and may be described according to specific situations, and may be set to 0 for example.
The method for determining the webpage text area where the webpage text information in the webpage to be processed is located in the candidate text set comprises the following steps: grouping DOM nodes in the candidate text set; and determining a webpage text area where the webpage text information in the webpage to be processed is located according to the summation result of the text complexity corresponding to the multiple DOM nodes in each group.
Specifically, grouping DOM nodes in a candidate body text set includes: acquiring the distance between the left side of each DOM node in the candidate text set and the page boundary; dividing DOM nodes with the same distance into the same group.
Further, determining a webpage body area where webpage body information in the webpage to be processed is located according to a summation result of the text complexity corresponding to the multiple DOM nodes in each group, including: and marking the webpage area corresponding to the lowest common ancestor node of all DOM nodes in the group corresponding to the highest summation result as a webpage text area.
Specifically, firstly, defining a candidate text set N; then, sequentially traversing all DOM nodes in the DOM tree, and extracting the text content of the DOM nodes if the leaf nodes of the DOM nodes have the text content; then, the extracted text content is subjected to word segmentation processing, part-of-speech tagging processing, and dependency syntax analysis processing using a Natural Language processing library (NLTK), and the text complexity of the DOM nodes is defined in terms of the ratio of connected words, the height of syntax trees, the number of clauses, and the like.
The method comprises the following specific steps: a) The method comprises the following steps The connective word proportion (conj _ rate) represents the proportion of the number of participles of the part of speech such as pronouns and conjunctions in the statistical participle result to the total number of the participles; b) The method comprises the following steps The syntax tree height (parsing _ height) represents the height of the syntax tree resulting from the computation dependent syntax analysis; c) The method comprises the following steps The number of clauses (punc _ count) represents the number of punctuation marks in the text content of the statistical DOM node, wherein the punctuation marks comprise commas, semicolons, periods, ellipses, question marks, exclamation marks and the like; calculating to obtain the text complexity (score) corresponding to the DOM node through a formula (1);
score=conj_rate*parsing_height*punc_count;(1)
and if the text complexity score corresponding to the DOM node is larger than 0, adding the corresponding DOM node into the candidate text set N.
After that, the DOM nodes in the candidate body set N are grouped. The distance (left _ loc) between the left side edge of each DOM node in the candidate text set N and the page boundary is firstly obtained. And dividing DOM nodes with the same left _ loc into the same group. The text complexity scores of all DOM nodes within each group are summed, and the sum of the text complexity of each group is denoted score sum.
And finally, marking the webpage area corresponding to the lowest common ancestor node of all DOM nodes in the group with the highest score _ sum as a webpage text area, wherein text information in the area is the webpage text information.
Based on the same inventive concept, embodiments of the present specification provide a web page text information extraction apparatus, an electronic device, and a computer storage medium corresponding to the web page text information extraction method of any one of the foregoing embodiments.
As shown in fig. 2, an embodiment of the present specification further provides an apparatus 100 for extracting text information of a web page, including:
the acquisition module 101: and acquiring a DOM tree corresponding to the webpage to be processed, wherein DOM nodes in the DOM tree comprise text information corresponding to the DOM nodes.
Wherein, the obtaining module 101 includes: an analysis module: analyzing and rendering a webpage document and a resource corresponding to a webpage to be processed through a browser or a browser kernel; constructing a module: after the webpage to be processed is analyzed and rendered, acquiring DOM structure information and text information corresponding to the webpage to be processed; and constructing a DOM tree corresponding to the webpage to be processed according to the DOM structure information and the text information.
Specifically, before analyzing and rendering the web document and the resource corresponding to the web page to be processed, the web page text information extracting apparatus 100 further includes: a receiving module: receiving a webpage address corresponding to a webpage to be processed through a browser or a browser kernel; and acquiring the webpage document and the resource corresponding to the target link according to the target link corresponding to the webpage address.
Further, after acquiring the DOM tree corresponding to the webpage to be processed, the webpage text information extracting apparatus 100 further includes: a deletion module: deleting DOM nodes irrelevant to the text information of the webpage and all descendant nodes thereof from the DOM tree; a merging module: and merging the DOM nodes belonging to the same paragraph label into the DOM node corresponding to the paragraph label according to the label of the DOM node.
The processing module 102: and determining a webpage body area where the webpage body information in the webpage to be processed is located according to the text complexity of the text information corresponding to each DOM node in the DOM tree.
Wherein, the processing module 102 includes: a first processing submodule: for each DOM node in the DOM tree, if leaf nodes of the DOM nodes contain text contents, extracting the text contents; performing word segmentation processing, part-of-speech tagging processing and dependency syntactic analysis processing on the text content to obtain a processing result; a second processing sub-module: obtaining the text complexity corresponding to the DOM node according to the connecting word proportion, the syntax tree height and the number of clauses corresponding to the processing result; sequentially traversing all DOM nodes, and if the text complexity of the DOM nodes is greater than the complexity threshold value, adding the corresponding DOM nodes into the candidate text set; a third processing submodule: and determining a webpage text area where the webpage text information in the webpage to be processed is located in the candidate text set.
Specifically, the third processing sub-module includes: a grouping unit: grouping DOM nodes in the candidate text set; a summing unit: and determining a webpage text area where the webpage text information in the webpage to be processed is located according to the summation result of the text complexity corresponding to the multiple DOM nodes in each group.
Wherein, the grouping unit includes: a grouping subunit: acquiring the distance between the left side of each DOM node in the candidate text set and the page boundary; dividing DOM nodes with the same distance into the same group.
Further, a summing unit comprising: a summation subunit: and marking the webpage area corresponding to the lowest common ancestor node of all DOM nodes in the group corresponding to the highest summation result as a webpage text area.
The extraction module 103: and extracting webpage text information from the webpage text area.
An embodiment of the present specification further provides an electronic device, including: at least one processor; and a memory communicatively coupled to the at least one processor; the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor to enable the at least one processor to execute the webpage text information extraction method.
The embodiment of the specification further provides a computer storage medium, which stores computer-executable instructions, and the computer-executable instructions are executed by a processor to execute the webpage text information extraction method.
It should be noted that the computer storage medium may include, but is not limited to: a portable disk, a hard disk, random access memory, read only memory, erasable programmable read only memory, optical storage device, magnetic storage device, or any suitable combination of the foregoing.
In a possible implementation manner, the present application may also provide that the data processing is implemented in the form of a program product, which includes program code for causing a terminal device to perform several steps of the method according to any one of the foregoing embodiments when the program product is run on the terminal device.
Where program code for executing the present application is written in any combination of one or more programming languages, the program code may execute entirely on the user device, partly on the user device, as a stand-alone software package, partly on the user device and partly on a remote device or entirely on the remote device.
It should be noted that the above example is only one combination application scheme formed by combining various embodiments in this specification, and the combination may also be in other forms, and is not described here.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on differences from other embodiments. In particular, for the product embodiments described later, since they correspond to the method, the description is simple, and the relevant points can be referred to the partial description of the system embodiments.
The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present application should be covered within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (6)

1. A webpage text information extraction method is characterized by comprising the following steps:
acquiring a DOM tree corresponding to a webpage to be processed, wherein DOM nodes in the DOM tree contain text information corresponding to the DOM nodes;
determining a webpage body area where the webpage body information is located in the webpage to be processed according to the text complexity of the text information corresponding to each DOM node in the DOM tree;
extracting the webpage text information from the webpage text area;
determining a webpage body area where the webpage body information is located in the webpage to be processed according to the text complexity of the text information corresponding to each DOM node in the DOM tree, including:
for each DOM node in the DOM tree, if the leaf node of the DOM node contains text content, extracting the text content;
performing word segmentation processing, part-of-speech tagging processing and dependency syntactic analysis processing on the text content to obtain a processing result;
obtaining the text complexity corresponding to the DOM node according to the connecting word proportion, the syntax tree height and the number of clauses corresponding to the processing result;
sequentially traversing all DOM nodes, and if the text complexity of the DOM nodes is greater than a complexity threshold value, adding the corresponding DOM nodes into a candidate text set;
determining the webpage text area where the webpage text information in the webpage to be processed is located in the candidate text set;
wherein the determining the webpage text area where the webpage text information in the webpage to be processed is located in the candidate text set includes:
acquiring the distance between the left side of each DOM node in the candidate text set and the page boundary;
dividing the DOM nodes with the same distance into the same group;
and determining the webpage text area where the webpage text information in the webpage to be processed is located according to the sum result of the text complexity corresponding to the multiple DOM nodes in each group.
2. The method for extracting webpage text information according to claim 1, wherein after the obtaining the DOM tree corresponding to the webpage to be processed, the method for extracting webpage text information further comprises:
deleting the DOM node irrelevant to the webpage text information and all descendant nodes thereof from the DOM tree;
and combining the DOM nodes belonging to the same paragraph label into the DOM node corresponding to the paragraph label according to the label of the DOM node.
3. The method for extracting webpage text information according to claim 1, wherein the determining the webpage text area where the webpage text information is located in the webpage to be processed according to the summation result of the text complexity corresponding to the DOM nodes in each of the groups comprises:
and marking the webpage area corresponding to the lowest common ancestor node of all the DOM nodes in the group corresponding to the highest summation result as the webpage text area.
4. A web page text information extraction device is characterized by comprising:
an acquisition module: acquiring a DOM tree corresponding to a webpage to be processed, wherein DOM nodes in the DOM tree contain text information corresponding to the DOM nodes;
a processing module: determining a webpage body area where the webpage body information is located in the webpage to be processed according to the text complexity of the text information corresponding to each DOM node in the DOM tree;
an extraction module: extracting the webpage text information from the webpage text area;
wherein the processing module comprises:
for each DOM node in the DOM tree, if the leaf node of the DOM node contains text content, extracting the text content;
performing word segmentation processing, part-of-speech tagging processing and dependency syntactic analysis processing on the text content to obtain a processing result;
obtaining the text complexity corresponding to the DOM node according to the connecting word proportion, the syntax tree height and the number of clauses corresponding to the processing result;
sequentially traversing all DOM nodes, and if the text complexity of the DOM nodes is greater than a complexity threshold value, adding the corresponding DOM nodes into a candidate text set;
determining the webpage text area where the webpage text information in the webpage to be processed is located in the candidate text set;
determining the webpage text area where the webpage text information in the webpage to be processed is located in the candidate text set, wherein the determining comprises the following steps:
acquiring the distance between the left side of each DOM node in the candidate text set and the page boundary;
dividing the DOM nodes with the same distance into the same group;
and determining the webpage text area where the webpage text information is located in the webpage to be processed according to the summation result of the text complexity corresponding to the DOM nodes in each group.
5. An electronic device, comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of extracting webpage body information according to any one of claims 1 to 3.
6. A computer storage medium storing computer-executable instructions that, when executed by a processor, perform the web page text information extraction method of any one of claims 1-3.
CN202211330508.9A 2022-10-28 2022-10-28 Webpage text information extraction method, device, equipment and medium Active CN115391711B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211330508.9A CN115391711B (en) 2022-10-28 2022-10-28 Webpage text information extraction method, device, equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211330508.9A CN115391711B (en) 2022-10-28 2022-10-28 Webpage text information extraction method, device, equipment and medium

Publications (2)

Publication Number Publication Date
CN115391711A CN115391711A (en) 2022-11-25
CN115391711B true CN115391711B (en) 2023-03-24

Family

ID=84115031

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211330508.9A Active CN115391711B (en) 2022-10-28 2022-10-28 Webpage text information extraction method, device, equipment and medium

Country Status (1)

Country Link
CN (1) CN115391711B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116362223B (en) * 2023-03-07 2023-12-15 北京粉笔蓝天科技有限公司 Automatic identification method and device for web page article titles and texts

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150135061A1 (en) * 2013-11-08 2015-05-14 Qualcomm Incorporated Systems and methods for parallel traversal of document object model tree
CN106528509B (en) * 2016-11-11 2020-04-03 政和科技股份有限公司 Webpage information extraction method and device
CN107894974A (en) * 2017-11-02 2018-04-10 华南农业大学 Webpage context extraction method based on tag path and text punctuate than Fusion Features
CN108897749A (en) * 2018-04-19 2018-11-27 中国科学院计算技术研究所 Method for abstracting web page information and system based on syntax tree and text block density
CN114818680A (en) * 2022-06-02 2022-07-29 天翼数字生活科技有限公司 Method and device for identifying webpage text and related equipment

Also Published As

Publication number Publication date
CN115391711A (en) 2022-11-25

Similar Documents

Publication Publication Date Title
Sun et al. Dom based content extraction via text density
US8326830B2 (en) Pattern recognition in web search engine result pages
CN101251855B (en) Equipment, system and method for cleaning internet web page
US8255793B2 (en) Automatic visual segmentation of webpages
US8819028B2 (en) System and method for web content extraction
CN109543126B (en) Webpage text information extraction method based on block character ratio
CN113822067A (en) Key information extraction method and device, computer equipment and storage medium
CN103309961B (en) Webpage content extraction method based on Markov random field
CN105022803B (en) A kind of method and system for extracting Web page text content
US20200004792A1 (en) Automated website data collection method
US8392820B2 (en) Method of establishing a plain text document from a HTML document
US9514113B1 (en) Methods for automatic footnote generation
CN110390038A (en) Segment method, apparatus, equipment and storage medium based on dom tree
CN105205080A (en) Redundant file clearing method, device and system
CN106547895B (en) Webpage information extraction method and device
CN115391711B (en) Webpage text information extraction method, device, equipment and medium
CN107145591B (en) Title-based webpage effective metadata content extraction method
CN114443928B (en) Web text data crawler method and system
CN106372232B (en) Information mining method and device based on artificial intelligence
CN114818680A (en) Method and device for identifying webpage text and related equipment
CN114021042A (en) Webpage content extraction method and device, computer equipment and storage medium
CN108694192B (en) Webpage type judging method and device
CN111339457A (en) Method and apparatus for extracting information from web page and storage medium
CN112380337A (en) Highlight method and device based on rich text
CN106897287B (en) Webpage release time extraction method and device for webpage release time extraction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant