CN115238078A - Webpage information extraction method, device, equipment and storage medium - Google Patents

Webpage information extraction method, device, equipment and storage medium Download PDF

Info

Publication number
CN115238078A
CN115238078A CN202210959612.8A CN202210959612A CN115238078A CN 115238078 A CN115238078 A CN 115238078A CN 202210959612 A CN202210959612 A CN 202210959612A CN 115238078 A CN115238078 A CN 115238078A
Authority
CN
China
Prior art keywords
entity
node
target
title
node set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210959612.8A
Other languages
Chinese (zh)
Inventor
周立运
其他发明人请求不公开姓名
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Digital Cube Beijing Pharmaceutical Technology Co ltd
Original Assignee
Digital Cube Beijing Pharmaceutical Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Digital Cube Beijing Pharmaceutical Technology Co ltd filed Critical Digital Cube Beijing Pharmaceutical Technology Co ltd
Priority to CN202210959612.8A priority Critical patent/CN115238078A/en
Publication of CN115238078A publication Critical patent/CN115238078A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/14Tree-structured documents
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/258Heading extraction; Automatic titling; Numbering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a method, a device, equipment and a storage medium for extracting webpage information, and belongs to the technical field of the Internet. The method comprises the following steps: analyzing a webpage to be processed to obtain a target tree object and text information corresponding to nodes in the target tree object; respectively processing text information corresponding to nodes in the target tree object based on a target title classifier and a target entity classifier, and determining a title node set and an entity node set from the target tree object according to a processing result; performing entity extraction on text information corresponding to the entity node set based on the target entity identification model to obtain entity information of the webpage to be processed; and determining the content position information of the webpage to be processed according to the title node set and the entity node set. Through the technical scheme, the information extraction of the webpage with a flexible structure can be realized.

Description

Webpage information extraction method, device, equipment and storage medium
Technical Field
The present invention relates to the field of internet technologies, and in particular, to a method, an apparatus, a device, and a storage medium for extracting web page information.
Background
Today in the internet era, web pages have become important information carriers for most people to recognize the world. Meanwhile, as web pages often contain more or less noisy text, there is an increasing demand for mining effective information from web pages.
In the industrial data mining process, the main task is to perform entity extraction for data analysis in the professional field; but usually, the extraction task of the web page content is added, and the extracted web page content needs to contain the extracted entity so as to shield irrelevant text blocks used in the web page and locally display the context of the extracted entity. However, most of existing web content extraction schemes are only applicable to web pages with fixed structures, and data from different sites need to be reset with corresponding extraction strategies, which lacks generality.
Disclosure of Invention
The invention provides a webpage information extraction method, a device, equipment and a storage medium, which are used for realizing information extraction of a webpage with a flexible structure.
According to an aspect of the present invention, there is provided a method for extracting web page information, the method including:
analyzing a webpage to be processed to obtain a target tree object and text information corresponding to nodes in the target tree object;
respectively processing text information corresponding to nodes in the target tree object based on a target title classifier and a target entity classifier, and determining a title node set and an entity node set from the target tree object according to a processing result;
performing entity extraction on text information corresponding to the entity node set based on a target entity identification model to obtain entity information of the webpage to be processed;
determining the content position information of the webpage to be processed according to the title node set and the entity node set
According to another aspect of the present invention, there is provided a web page information extraction apparatus, including:
the webpage information analysis module is used for analyzing a webpage to be processed to obtain a target tree object and text information corresponding to nodes in the target tree object;
a node set determining module, configured to process text information corresponding to a node in the target tree object based on a target title classifier and a target entity classifier, and determine a title node set and an entity node set from the target tree object according to a processing result;
the entity information determining module is used for performing entity extraction on the text information corresponding to the entity node set based on the target entity identification model to obtain the entity information of the webpage to be processed;
and the content position information determining module is used for determining the content position information of the webpage to be processed according to the title node set and the entity node set.
According to another aspect of the present invention, there is provided an electronic apparatus including:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein, the first and the second end of the pipe are connected with each other,
the memory stores a computer program executable by the at least one processor, and the computer program is executed by the at least one processor to enable the at least one processor to execute the web page information extraction method according to any embodiment of the present invention.
According to another aspect of the present invention, there is provided a computer-readable storage medium storing computer instructions for causing a processor to implement the method for extracting web page information according to any embodiment of the present invention when the computer instructions are executed.
According to the technical scheme, a target tree object and text information corresponding to nodes in the target tree object are obtained by analyzing a webpage to be processed, then the text information corresponding to the nodes in the target tree object is processed respectively based on a target title classifier and a target entity classifier, a title node set and an entity node set are determined from the target tree object according to a processing result, then entity extraction is carried out on the text information corresponding to the entity node set based on a target entity recognition model to obtain entity information of the webpage to be processed, and finally content position information of the webpage to be processed is determined according to the title node set and the entity node set. According to the technical scheme, the title classifier and the entity classifier are introduced, namely the title nodes and the entity nodes are filtered by the double classifiers, and the two auxiliary classifiers can be used for extracting information of webpages with different structures, so that the universality is enhanced; meanwhile, the accuracy of extracting the webpage information can be improved.
It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present invention, nor do they necessarily limit the scope of the invention. Other features of the present invention will become apparent from the following description.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a flowchart of a method for extracting web page information according to an embodiment of the present invention;
fig. 2A is a flowchart of a method for extracting web page information according to a second embodiment of the present invention;
FIG. 2B is a schematic diagram of a candidate content node determination according to a second embodiment of the present invention;
fig. 3 is a flowchart of a method for extracting web page information according to a third embodiment of the present invention;
fig. 4 is a schematic structural diagram of a web page information extraction apparatus according to a fourth embodiment of the present invention;
fig. 5 is a schematic structural diagram of an electronic device implementing the method for extracting web page information according to the embodiment of the present invention.
Detailed Description
In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, shall fall within the protection scope of the present invention.
It should be noted that the terms "sample," "object," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in other sequences than those illustrated or described herein. Moreover, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
In addition, in the technical scheme of the invention, the collection, storage, use, processing, transmission, provision, disclosure and other processing of the webpage to be processed and the like all meet the regulations of relevant laws and regulations and do not violate the good custom of the public order.
Example one
Fig. 1 is a flowchart of a method for extracting web page information according to an embodiment of the present invention. The embodiment is applicable to the situation of how to extract the webpage information, and the method can be executed by a webpage information extracting device, which can be implemented in the form of hardware and/or software, and the webpage information extracting device can be integrated in an electronic device bearing a webpage information extracting function, such as a server. As shown in fig. 1, the method for extracting web page information of this embodiment may include:
s110, analyzing the webpage to be processed to obtain a target tree object and text information corresponding to the nodes in the target tree object.
In this embodiment, the to-be-processed web page refers to a web page document that needs to be subjected to information extraction, and may be, for example, an HTML (Hyper Text Markup Language) document. It should be noted that the HTML document is a concrete data representation form of the web page, and the web page is a representation form of the HTML document rendered by the browser; the HTML document comprises HTML tags such as div, p, table and the like, or CSS, javaScript codes and the like, and a large amount of irrelevant characters are contained besides effective webpage content; HTML is a standard tree structure, and each HTML node is organized in a hierarchical relationship to form a Document tree structure, i.e., a Document Object Model (DOM).
The target tree object is a document tree structure, namely a DOM tree, corresponding to the webpage to be processed.
The text information corresponding to the nodes in the target tree object refers to the text information in the webpage corresponding to each node in the target tree object.
Specifically, the to-be-processed web page may be analyzed based on an HTML analysis tool (e.g., lxml), so as to obtain a target tree object corresponding to the to-be-processed web page and text information corresponding to a node in the target tree object.
And S120, respectively processing the text information corresponding to the nodes in the target tree object based on the target title classifier and the target entity classifier, and determining a title node set and an entity node set from the target tree object according to the processing result.
In this embodiment, the target header classifier is a classifier for determining that a node in a target tree object is a header node. A target entity classifier refers to a classifier for determining that a node in a target tree object is an entity node.
Wherein, the title node is the node of the index title class; it should be noted that the title in the present invention can be understood as a paragraph of a class title with a relatively obvious mark, and the text of the paragraph has a certain regularity, and generally can be a section start, a header, and the like. An entity node refers to a node of an entity class.
Specifically, for each node in the target tree object, inputting text information corresponding to the node into a target title classifier and a target entity classifier, processing the text information by the classifiers to obtain a category to which the text information corresponding to the node belongs, and then determining whether the node is a title node or an entity node according to the category to which the text information corresponding to the node belongs, wherein specifically, if the category to which the text information corresponding to the node belongs is the title category, the node is the title node; and if the type of the text message corresponding to the node is the entity type, the node is the entity node. That is, in the above manner, the set of title nodes, as well as the set of entity nodes, may be determined from the target tree object.
It should be noted that the entities are whole or partial texts of the minimum constituent elements of the HTML document, such as words or phrases in professional fields. The title is a title of a paragraph, a table, a chapter, or the like to which the entity belongs.
S130, performing entity extraction on the text information corresponding to the entity node set based on the target entity recognition model to obtain entity information of the webpage to be processed.
In this embodiment, the target entity recognition model refers to a model for entity recognition. The entity information refers to entity content contained in the webpage to be processed.
Specifically, the text information corresponding to each node in the entity node set may be input into the target entity recognition model, and entity extraction may be performed to obtain entity information of the to-be-processed web page.
And S140, determining the content position information of the webpage to be processed according to the title node set and the entity node set.
In this embodiment, the content location information refers to node information to which the entity information belongs or corresponds, and may be represented by path selection information, such as XPath path selection information, for example. For example, a paragraph to be extracted expresses that a target entity "hospital liquid oxygen supply service" exists in the paragraph, and the paragraph is preceded by a representative title such as "three, main subject information", that is, node information corresponding to the paragraph is content position information, and can also be understood as approximate position information corresponding to entity information.
Optionally, the content location information of the to-be-processed web page may be determined according to the title node set and the entity node set based on a certain rule. For example, the content location information of the web page to be processed may be determined from the title node set and the entity node set based on the location determination model.
It should be noted that, in the present invention, the execution sequence of S130 and S140 is not specifically limited, and may be executed simultaneously, or S130 is executed first and then S140 is executed, or S140 is executed first and then S130 is executed.
According to the technical scheme, a target tree object and text information corresponding to nodes in the target tree object are obtained by analyzing a webpage to be processed, then the text information corresponding to the nodes in the target tree object is processed respectively based on a target title classifier and a target entity classifier, a title node set and an entity node set are determined from the target tree object according to a processing result, then entity extraction is carried out on the text information corresponding to the entity node set based on a target entity recognition model to obtain entity information of the webpage to be processed, and finally content position information of the webpage to be processed is determined according to the title node set and the entity node set. According to the technical scheme, the title classifier and the entity classifier are introduced, namely the title nodes and the entity nodes are filtered by the double classifiers, and the two auxiliary classifiers can be used for extracting information of webpages with different structures, so that the universality is enhanced; meanwhile, the accuracy of extracting the webpage information can be improved.
Example two
Fig. 2 is a flowchart of a web page information extraction method according to a second embodiment of the present invention. On the basis of the embodiment and the above embodiment, an optional implementation scheme is provided for further optimizing the "determining content location information of the to-be-processed web page according to the title node set and the entity node set". As shown in fig. 2, the method for extracting web page information of this embodiment may include:
s210, analyzing the webpage to be processed to obtain the target tree object and the text information corresponding to the node in the target tree object.
S220, respectively processing the text information corresponding to the nodes in the target tree object based on the target title classifier and the target entity classifier, and determining a title node set and an entity node set from the target tree object according to the processing result.
And S230, performing entity extraction on the text information corresponding to the entity node set based on the target entity identification model to obtain entity information of the webpage to be processed.
S240, carrying out node matching on the title node set and the entity node set to obtain a title entity node pair set.
In this embodiment, the heading node pair refers to a node pair formed by a node of an entity and a node of a heading corresponding to the entity. A header node pair set refers to a set of header node pairs in a target tree object.
Optionally, the header node set and the entity node set may be subjected to node matching based on a certain node matching rule, so as to obtain the header entity node pair set. For example, the node matching may be performed on the header node set and the entity node set according to the distance between the node in the header node set and the node in the entity node set, so as to obtain the header entity node pair. Specifically, for each entity node in the entity node set, the distance between the entity node and each title node in the title node set is determined, and the node in the title node set corresponding to the minimum distance is selected as the title node corresponding to the entity node, that is, the entity node and the title node corresponding to the entity node form a pair of title entity node pairs. For example, if node a is an entity node, node b is a header node, and node p is the smallest common ancestor node of node a and node b, the distance between node a and node b may be determined according to the depth of the node in the target tree object, and specifically may be dist (a, b) = [ h (a) -h (p) ] + [ h (b) -h (p) ] = h (a) + h (b) -2h (p), where h (n) represents the depth of node n, and dist (a, b) represents the distance between node a and node b. In particular, if node a is a descendant node of node b, the smallest common ancestor node of node a and node b is node b, i.e., p = b, then dist (a, b) = h (a) -h (b).
And S250, determining a target content node set from the target tree object according to the title entity node pair set and the hierarchical relationship between the nodes in the target tree object.
In this embodiment, the target content node refers to a node in a previous level including an entity node and a title node corresponding to the entity node, and may also be understood as a node corresponding to an area to which a title corresponding to the entity content and the entity content belongs, for example, a node corresponding to a paragraph where a certain entity content and a title thereof are located, or a node corresponding to a table where a certain entity content and a title thereof are located.
Optionally, determining a candidate content node set from the target tree object according to the hierarchical relationship between the title entity node pair set and the nodes in the target tree object; and carrying out duplication removal processing on the candidate content node set to obtain a target content node set.
Specifically, for each title entity node pair, determining the minimum common ancestor node of the title entity node pair from the target tree object according to the hierarchical relationship between the nodes in the target tree object; and taking the minimum common ancestor node as a candidate content node of the title entity node pair. For example, fig. 2B shows a specific form of a target tree object of a to-be-processed web page, such as the entity node entry node and the title node subtitle node shown in fig. 2B, where a minimum common ancestor node of the two is tbody, and the tbody node is a candidate content node of the entity node entry node and the title node subtitle node.
Further, under the same heading node, there may be multiple entity nodes, and in this case, there may be multiple heading entity node pairs with the same heading node, so there may be multiple candidate content nodes obtained. Therefore, the candidate content node set may be subjected to a deduplication process, that is, the same candidate content nodes are removed, so as to obtain a target content node set.
And S260, determining content position information of the webpage to be processed according to the target content node set.
According to the technical scheme, a target tree object and text information corresponding to nodes in the target tree object are obtained by analyzing a webpage to be processed, then the text information corresponding to the nodes in the target tree object is processed respectively based on a target title classifier and a target entity classifier, a title node set and an entity node set are determined from the target tree object according to a processing result, then entity extraction is carried out on the text information corresponding to the entity node set based on a target entity recognition model to obtain the entity information of the webpage to be processed, finally node matching is carried out on the title node set and the entity node set to obtain a title entity node set, a target content node set is determined from the target tree object according to the hierarchical relation between the title entity node set and the nodes in the target tree object, and content position information of the webpage to be processed is determined according to the target content node set. According to the technical scheme, the target content nodes are determined through the hierarchical relation among the nodes so as to determine the content position information, namely the area where the entity is located, and the content position information is determined more accurately.
EXAMPLE III
Fig. 3 is a flowchart of a method for extracting web page information according to a third embodiment of the present invention. Based on the above embodiments, this embodiment further elaborates the determination manners of the "target title classifier", "target entity classifier", and "target entity recognition model", and provides an alternative implementation. As shown in fig. 3, the method for extracting web page information of this embodiment may include:
s310, analyzing the webpage to be processed to obtain a target tree object and text information corresponding to the nodes in the target tree object.
And S320, respectively processing the text information corresponding to the nodes in the target tree object based on the target title classifier and the target entity classifier, and determining a title node set and an entity node set from the target tree object according to the processing result.
In this embodiment, the target title classifier and the target entity classifier may be obtained as follows: training a classifier to be trained according to the sample classification data set to obtain a target title classifier and a target entity classifier; the sample classification data set comprises text information corresponding to at least one node and label information corresponding to at least one node.
The label information is a label of the text information corresponding to the node, that is, a label to which the node belongs, and may be a title label or an entity label. For example, if a section to be extracted expresses an item, and a target entity "hospital liquid oxygen supply service" exists in the section, and the section is preceded by a representative title such as "three, main target information", then the words of "main target information" can be labeled with a "title" label, and the words of "hospital liquid oxygen supply service" are labeled with an "entity" label; for another example, if the content to be extracted is a table, and a column of the table header is "subject content", and the column corresponds to data "IVC independent air supply system" as the target entity, the "subject content" and the "IVC independent air supply system" in the table are labeled with a "subject" label and an "entity" label, respectively.
It should be noted that, each sample data in the sample classification dataset is text information corresponding to a node, a title tag and an entity tag corresponding to the text information, such a triple (x (t), y) l ,y e ) Where x (t) denotes the text of the node t, y l Indicates a title label, y e Representing an entity tag.
Specifically, classifiers to be trained can be trained respectively according to the sample classification data to obtain a target title classifier and a target entity classifier.
S330, performing entity extraction on the text information corresponding to the entity node set based on the target entity identification model to obtain entity information of the webpage to be processed.
In this embodiment, the sample data in the sample classification dataset is also labeled based on a specific labeling tool, that is, the labeling result includes the start-stop character position of the labeling area and the path selection information relative to the DOM tree. For example, labeling of sample classification data may be performed by label-studio; the label-studio is an open-source data labeling system supporting various data formats, can render an imported labeled HTML document, provides a form of selecting characters through mouse dragging for a user to highlight a target area, and provides start and stop character positions of a labeling area and XPath path selection information relative to a DOM tree by metadata for exporting a labeling result.
Specifically, data with tag information as an entity tag is extracted from a sample classification dataset to construct a sample identification dataset, namely (x (e), r (e)), wherein e represents a node of the entity tag, and x (e) and r (e) respectively represent an original text of the node e (i.e. text information corresponding to the node) and a text of a labeling result (i.e. text information including the labeling result (such as start-stop character position of a labeling area and XPath path selection information relative to a DOM tree)). And then, training the entity recognition model to be trained by adopting the sample recognition data set to obtain a target entity recognition model.
And S340, determining the content position information of the webpage to be processed according to the title node set and the entity node set.
According to the technical scheme, a target tree object and text information corresponding to nodes in the target tree object are obtained by analyzing a webpage to be processed, then the text information corresponding to the nodes in the target tree object is processed respectively based on a target title classifier and a target entity classifier, a title node set and an entity node set are determined from the target tree object according to a processing result, then entity extraction is carried out on the text information corresponding to the entity node set based on a target entity recognition model to obtain entity information of the webpage to be processed, and finally content position information of the webpage to be processed is determined according to the title node set and the entity node set. According to the technical scheme, the title classifier and the entity classifier are introduced, namely the title nodes and the entity nodes are filtered by the double classifiers, and the two auxiliary classifiers can be used for extracting information of webpages with different structures, so that the universality is enhanced; meanwhile, the accuracy of extracting the webpage information can be improved.
Here, it should be noted that the application range of the present invention is to simultaneously perform an entity extraction task and a content extraction task of a web page, and extracted content requires a representative title feature and includes an extracted entity. The invention is protected under the condition.
Example four
Fig. 4 is a schematic structural diagram of a web page information extraction apparatus according to a fourth embodiment of the present invention. The embodiment is applicable to the situation of how to extract the webpage information, the webpage information extracting device can be implemented in the form of hardware and/or software, and the webpage information extracting device can be integrated in an electronic device bearing a webpage information extracting function, such as a server. As shown in fig. 4, the web page information extraction apparatus of the present embodiment may include:
the webpage information analyzing module 410 is configured to analyze a webpage to be processed to obtain a target tree object and text information corresponding to a node in the target tree object;
a node set determining module 420, configured to process text information corresponding to a node in a target tree object based on a target title classifier and a target entity classifier, and determine a title node set and an entity node set from the target tree object according to a processing result;
the entity information determining module 430 is configured to perform entity extraction on text information corresponding to the entity node set based on the target entity identification model to obtain entity information of the to-be-processed web page;
and the content position information determining module 440 is configured to determine content position information of the to-be-processed web page according to the title node set and the entity node set.
According to the technical scheme, a target tree object and text information corresponding to nodes in the target tree object are obtained by analyzing a webpage to be processed, then the text information corresponding to the nodes in the target tree object is processed respectively based on a target title classifier and a target entity classifier, a title node set and an entity node set are determined from the target tree object according to a processing result, then entity extraction is carried out on the text information corresponding to the entity node set based on a target entity recognition model to obtain the entity information of the webpage to be processed, and finally content position information of the webpage to be processed is determined according to the title node set and the entity node set. According to the technical scheme, the title classifier and the entity classifier are introduced, namely the title nodes and the entity nodes are filtered by the double classifiers, and the two auxiliary classifiers can be used for extracting information of webpages with different structures, so that the universality is enhanced; meanwhile, the accuracy of extracting the webpage information can be improved.
Optionally, the content location information determining module 440 includes:
the node pair determining unit is used for carrying out node matching on the title node set and the entity node set to obtain a title entity node pair set;
a content node set determining unit, configured to determine a target content node set from the target tree object according to the hierarchical relationship between the title entity node pair set and the nodes in the target tree object;
and the content position information determining unit is used for determining the content position information of the webpage to be processed according to the target content node set.
Optionally, the node pair determining unit is specifically configured to:
and performing node matching on the title node set and the entity node set according to the distance between the nodes in the title node set and the nodes in the entity node set to obtain a title entity node pair.
Optionally, the content node set determining unit is specifically configured to:
determining a candidate content node set from the target tree object according to the hierarchical relationship between the title entity node pair set and the nodes in the target tree object;
and carrying out duplicate removal processing on the candidate content node set to obtain a target content node set.
Optionally, the content node set determining unit is further specifically configured to:
for each title entity node pair, determining the minimum common ancestor node of the title entity node pair from the target tree object according to the hierarchical relationship between the nodes in the target tree object;
and taking the minimum common ancestor node as a candidate content node of the title entity node pair.
Optionally, the apparatus further comprises:
the classifier determining module is used for training a classifier to be trained according to the sample classification data set to obtain a target title classifier and a target entity classifier; the sample classification dataset comprises text information corresponding to at least one node and label information corresponding to at least one node.
Optionally, the apparatus further comprises an entity identification model determining module, configured to:
extracting label information from the sample classification data set as data of an entity label, and constructing a sample identification data set;
and training the entity recognition model to be trained by adopting the sample recognition data set to obtain the target entity recognition model.
The webpage information extraction device provided by the embodiment of the invention can execute the webpage information extraction method provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method.
EXAMPLE five
Fig. 5 is a schematic structural diagram of an electronic device implementing the web page information extraction method according to the embodiment of the present invention, and fig. 5 is a schematic structural diagram of an electronic device 10 that can be used to implement the embodiment of the present invention. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital assistants, cellular phones, smart phones, wearable devices (e.g., helmets, glasses, watches, etc.), and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed herein.
As shown in fig. 5, the electronic device 10 includes at least one processor 11, and a memory communicatively connected to the at least one processor 11, such as a Read Only Memory (ROM) 12, a Random Access Memory (RAM) 13, and the like, wherein the memory stores a computer program executable by the at least one processor, and the processor 11 may perform various suitable actions and processes according to the computer program stored in the Read Only Memory (ROM) 12 or the computer program loaded from the storage unit 18 into the Random Access Memory (RAM) 13. In the RAM 13, various programs and data necessary for the operation of the electronic apparatus 10 can also be stored. The processor 11, the ROM 12, and the RAM 13 are connected to each other via a bus 14. An input/output (I/O) interface 15 is also connected to the bus 14.
A number of components in the electronic device 10 are connected to the I/O interface 15, including: an input unit 16 such as a keyboard, a mouse, or the like; an output unit 17 such as various types of displays, speakers, and the like; a storage unit 18 such as a magnetic disk, optical disk, or the like; and a communication unit 19 such as a network card, modem, wireless communication transceiver, etc. The communication unit 19 allows the electronic device 10 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.
The processor 11 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of processor 11 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various processors running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The processor 11 performs the various methods and processes described above, such as a web page information extraction method.
In some embodiments, the web page information extraction method may be implemented as a computer program that is tangibly embodied in a computer-readable storage medium, such as storage unit 18. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 10 via the ROM 12 and/or the communication unit 19. When the computer program is loaded into the RAM 13 and executed by the processor 11, one or more steps of the above-described web page information extraction method may be performed. Alternatively, in other embodiments, the processor 11 may be configured to perform the web page information extraction method by any other suitable means (e.g., by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
A computer program for implementing the methods of the present invention may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the computer programs, when executed by the processor, cause the functions/acts specified in the flowchart and/or block diagram block or blocks to be performed. A computer program can execute entirely on a machine, partly on a machine, as a stand-alone software package partly on a machine and partly on a remote machine or entirely on a remote machine or server.
In the context of the present invention, a computer-readable storage medium may be a tangible medium that can contain, or store a computer program for use by or in connection with an instruction execution system, apparatus, or device. A computer readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. Alternatively, the computer readable storage medium may be a machine readable signal medium. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on an electronic device having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the electronic device. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), blockchain networks, and the internet.
The computing system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service are overcome.
It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present invention may be executed in parallel, sequentially, or in different orders, and are not limited herein as long as the desired result of the technical solution of the present invention can be achieved.
The above-described embodiments should not be construed as limiting the scope of the invention. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. A method for extracting web page information is characterized by comprising the following steps:
analyzing a webpage to be processed to obtain a target tree object and text information corresponding to nodes in the target tree object;
respectively processing text information corresponding to nodes in the target tree object based on a target title classifier and a target entity classifier, and determining a title node set and an entity node set from the target tree object according to a processing result;
performing entity extraction on text information corresponding to the entity node set based on the target entity identification model to obtain entity information of the webpage to be processed;
and determining the content position information of the webpage to be processed according to the title node set and the entity node set.
2. The method of claim 1, wherein the determining content location information of the to-be-processed web page according to the header node set and the entity node set comprises:
performing node matching on the title node set and the entity node set to obtain a title entity node pair set;
determining a target content node set from the target tree object according to the title entity node pair set and the hierarchical relationship between the nodes in the target tree object;
and determining the content position information of the webpage to be processed according to the target content node set.
3. The method of claim 2, wherein the performing node matching on the set of title nodes and the set of entity nodes to obtain a pair of title entity nodes comprises:
and performing node matching on the title node set and the entity node set according to the distance between the nodes in the title node set and the nodes in the entity node set to obtain a title entity node pair.
4. The method of claim 2, wherein determining the target set of content nodes from the target tree object based on the set of title entity node pairs and a hierarchical relationship between nodes in the target tree object comprises:
determining a candidate content node set from the target tree object according to the title entity node pair set and the hierarchical relationship between nodes in the target tree object;
and performing deduplication processing on the candidate content node set to obtain a target content node set.
5. The method of claim 4, wherein determining the set of candidate content nodes from the target tree object according to the set of title entity node pairs and the hierarchical relationship between the nodes in the target tree object comprises:
for each title entity node pair, determining the minimum common ancestor node of the title entity node pair from the target tree object according to the hierarchical relationship between the nodes in the target tree object;
and taking the minimum common ancestor node as a candidate content node of the title entity node pair.
6. The method of claim 1, further comprising:
training a classifier to be trained according to the sample classification data set to obtain a target title classifier and a target entity classifier; the sample classification dataset comprises text information corresponding to at least one node and label information corresponding to the at least one node.
7. The method of claim 1, further comprising:
extracting label information from the sample classification dataset as data of an entity label, and constructing a sample identification dataset;
and training the entity recognition model to be trained by adopting the sample recognition data set to obtain a target entity recognition model.
8. A web page information extraction device, characterized by comprising:
the webpage information analysis module is used for analyzing a webpage to be processed to obtain a target tree object and text information corresponding to nodes in the target tree object;
the node set determining module is used for respectively processing text information corresponding to the nodes in the target tree object based on the target title classifier and the target entity classifier, and determining a title node set and an entity node set from the target tree object according to a processing result;
the entity information determining module is used for performing entity extraction on the text information corresponding to the entity node set based on the target entity identification model to obtain the entity information of the webpage to be processed;
and the content position information determining module is used for determining the content position information of the webpage to be processed according to the title node set and the entity node set.
9. An electronic device, characterized in that the electronic device comprises:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein, the first and the second end of the pipe are connected with each other,
the memory stores a computer program executable by the at least one processor, the computer program being executable by the at least one processor to enable the at least one processor to perform the web page information extraction method of any one of claims 1-7.
10. A computer-readable storage medium, wherein the computer-readable storage medium stores computer instructions for causing a processor to implement the web page information extraction method according to any one of claims 1 to 7 when executed.
CN202210959612.8A 2022-08-10 2022-08-10 Webpage information extraction method, device, equipment and storage medium Pending CN115238078A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210959612.8A CN115238078A (en) 2022-08-10 2022-08-10 Webpage information extraction method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210959612.8A CN115238078A (en) 2022-08-10 2022-08-10 Webpage information extraction method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN115238078A true CN115238078A (en) 2022-10-25

Family

ID=83678750

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210959612.8A Pending CN115238078A (en) 2022-08-10 2022-08-10 Webpage information extraction method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN115238078A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115757823A (en) * 2022-11-10 2023-03-07 魔方医药科技(苏州)有限公司 Data processing method and device, electronic equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115757823A (en) * 2022-11-10 2023-03-07 魔方医药科技(苏州)有限公司 Data processing method and device, electronic equipment and storage medium
CN115757823B (en) * 2022-11-10 2024-03-05 魔方医药科技(苏州)有限公司 Data processing method, device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
US20150067476A1 (en) Title and body extraction from web page
CN110413787B (en) Text clustering method, device, terminal and storage medium
CN110020312B (en) Method and device for extracting webpage text
CN112650910B (en) Method, device, equipment and storage medium for determining website update information
CN113986864A (en) Log data processing method and device, electronic equipment and storage medium
CN112989235A (en) Knowledge base-based internal link construction method, device, equipment and storage medium
CN114692628A (en) Sample generation method, model training method, text extraction method and text extraction device
CN113935339A (en) Translation method, translation device, electronic equipment and storage medium
CN114092948B (en) Bill identification method, device, equipment and storage medium
CN114218951B (en) Entity recognition model training method, entity recognition method and device
CN113836316B (en) Processing method, training method, device, equipment and medium for ternary group data
CN115238078A (en) Webpage information extraction method, device, equipment and storage medium
CN104572874B (en) A kind of abstracting method and device of webpage information
CN113408660A (en) Book clustering method, device, equipment and storage medium
CN115600592A (en) Method, device, equipment and medium for extracting key information of text content
CN115331247A (en) Document structure identification method and device, electronic equipment and readable storage medium
CN114860867A (en) Training document information extraction model, and document information extraction method and device
CN114417862A (en) Text matching method, and training method and device of text matching model
CN113361522A (en) Method and device for determining character sequence and electronic equipment
CN113221566A (en) Entity relationship extraction method and device, electronic equipment and storage medium
CN115481240A (en) Data asset quality detection method and detection device
CN113239149A (en) Entity processing method, entity processing device, electronic equipment and storage medium
CN113407890B (en) Information extraction method, device, electronic equipment and medium
CN114492409B (en) Method and device for evaluating file content, electronic equipment and program product
CN114401419B (en) Video-based content generation method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination