CN113343140B - Method for automatically extracting webpage text content based on neo4j graphic database - Google Patents

Method for automatically extracting webpage text content based on neo4j graphic database Download PDF

Info

Publication number
CN113343140B
CN113343140B CN202010138403.8A CN202010138403A CN113343140B CN 113343140 B CN113343140 B CN 113343140B CN 202010138403 A CN202010138403 A CN 202010138403A CN 113343140 B CN113343140 B CN 113343140B
Authority
CN
China
Prior art keywords
nodes
node
text
webpage
graph
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010138403.8A
Other languages
Chinese (zh)
Other versions
CN113343140A (en
Inventor
刘亮
李萧洋
郑荣锋
李孟铭
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan University
Original Assignee
Sichuan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan University filed Critical Sichuan University
Priority to CN202010138403.8A priority Critical patent/CN113343140B/en
Publication of CN113343140A publication Critical patent/CN113343140A/en
Application granted granted Critical
Publication of CN113343140B publication Critical patent/CN113343140B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • G06F16/986Document structures and storage, e.g. HTML extensions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/80Information retrieval; Database structures therefor; File system structures therefor of semi-structured data, e.g. markup language structured data such as SGML, XML or HTML
    • G06F16/81Indexing, e.g. XML tags; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/80Information retrieval; Database structures therefor; File system structures therefor of semi-structured data, e.g. markup language structured data such as SGML, XML or HTML
    • G06F16/84Mapping; Conversion

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The method discloses a method for automatically extracting webpage text content based on a neo4j graphic database. The method comprises the following steps: step S101, acquiring HTML source codes of a webpage from an open source channel by using a simulation browser request technology to serve as a training set; step S102, extracting HTML labels and converting HTML source codes into tree structures; step S103, extracting triples representing the relationship between nodes from all nodes in the traversal tree; step S104, converting the relation triples into graphs by using a neo4j graph database; step S105, removing redundant nodes in the graph through node compression and branch compression; step S106, extracting multi-dimensional features, and training a text node classification model through machine learning; and S107, extracting the text nodes in the webpage by using the classification model, and sequentially recovering the complete webpage text content by using the child nodes of the text nodes. The invention provides a simple and easy-to-use implementation method for accurately and efficiently extracting the text content of the webpage.

Description

Method for automatically extracting webpage text content based on neo4j graphic database
Technical Field
The invention relates to the field of computer application and webpage content extraction, in particular to a method for automatically extracting webpage text content based on a neo4j graphic database.
Background
With the development of network technology, the internet provides an information sharing platform for human beings across space-time boundaries, and the text content of a webpage is an important source for people to quickly acquire information from the internet. Nowadays, the application field of web page text content extraction is more and more extensive, common users directly acquire own desired information from web page text content by using search engines, and other works based on web page processing, such as text mining, artificial intelligence, search engines and the like, all use efficient and accurate acquisition of web page text content as a premise.
As most of the existing websites use some specific templates or styles which are irrelevant to the text content to improve the readability of the webpage, the text content of the webpage is often mixed in some webpage noises such as advertisement links, navigation templates and the like. The method and the device can quickly and accurately extract the text content of the webpage from the noise, reduce the burden of the general public on acquiring information, and help other applications providing services based on the content of the webpage to improve the working efficiency. Therefore, how to automatically extract text content from a complex-designed webpage and apply the text content to a specific field is a problem that needs to be solved urgently by those skilled in the art.
Existing methods for extracting the body content of a web page can be classified into three categories according to the basis used for extracting the body content of the web page, namely, a text information-based web page body content extraction algorithm, a visual information-based web page body content extraction algorithm, and a document object type DOM-based web page body content extraction algorithm.
The main idea of the webpage text content extraction algorithm based on the text information is as follows: if a web page is divided into a plurality of areas, the text density of the text portion of the body of the web page is much higher than other areas of the web page, and furthermore if the entire web page is converted to text, the text lines containing the body content are generally closer in distance and contain a large number of punctuation marks. Although the method is simple to implement, characters near text content, such as copyright statement and the like, can be identified, so that the method has certain limitation in practical application.
The main idea of the webpage text content extraction algorithm based on the visual information is as follows: when browsing a web page, a user treats a semantic block as a single object, and the user often uses some visual information, such as font size, font color, background, table list, etc., to distinguish the semantic block. The visual information is combined with the DOM, the webpage is divided into a plurality of blocks, and the proportion of text nodes to leaf nodes in each block is calculated to judge whether the block belongs to the text block. However, this method needs to acquire the visual factors of the page, so the calculation amount is large; furthermore, if the visual factors in the page are controlled using different files such as CSS, extraction efficiency may be low.
The main idea of the webpage text content extraction algorithm based on the document object type DOM is as follows: a DOM tree is generated by extracting tags in HTML source codes, a webpage is divided into a plurality of blocks, and the relationship among nodes in the tree and the stored content are researched to extract the text content of the webpage. The method is mostly suitable for websites with good programming style and consistent typesetting, and the HTML language has much focus on how to display information rather than how to block the web pages, so the method has poor universality.
Disclosure of Invention
Aiming at the defects in the existing scheme, the application aims to provide the implementation method for automatically extracting the text content of the webpage with simple operation, high efficiency and accuracy.
In order to solve the technical problem, the present application provides a method for automatically extracting text content of a webpage based on a neo4j graph database, wherein the implementation method includes the following steps:
step S101: an HTML text file containing only an HTML tag is generated by using HTML source codes acquired from an open source channel and using an HTML processing technique to remove CSS styles and the like which are not related to the text content of an article. The method comprises the steps of acquiring tags such as < div >, < table > and the like in an HTML text by using an HTML processing technology, and converting source HTML into a tree structure according to the hierarchical relationship of the tags.
Step S102: and extracting a triple representing the mutual relation among the labels by traversing all the nodes in the tree and according to the connection relation among the nodes and the sequence relation among the sub-nodes. The relationship triplet structure is as follows: (src, r, dst), wherein "src" and "dst" respectively represent nodes in the tree structure, and "r" represents a relationship between two nodes. "src" includes the label of the parent node in the tree and a unique identifier, and "dst" includes the label of the child node in the tree, the unique identifier, the specific content stored by the child node, and the order of the child node in the set of child nodes where the child node is located.
Step S103: and converting the relation triple structure into a graph structure by using a neo4j graph database, and removing a part of redundant end nodes.
Step S104: and for the graph structure, dividing the empty nodes directly connected with the end nodes into two types according to the number of the end nodes connected with the empty nodes, and respectively performing node compression and branch node compression to obtain the compressed graph structure.
Step S105: and extracting the node quantity characteristic and the average text length characteristic of the compressed graph.
Step S106: combining the features to generate a feature vector, training a text node classification model by using an MLP model, classifying the nodes in the webpage by using the classification model, and extracting the text nodes in the webpage.
Step S107: and sequentially recovering the contents in the child nodes of the text nodes according to the extracted text nodes and extracting the complete text contents of the web pages.
Further, in step S102, the process of generating the relationship triple is as follows:
step S201: for the preprocessed and converted tree structure, circularly traversing all nodes in the tree structure, sequentially recording the sequence of all child nodes under each node from left to right, and generating records in the shapes of { parent node, 'connection relation', [ child node, xth child node ] } for all nodes under the < html > label;
step S202: converting the records into a relation triple structure and storing the relation triple structure into a neo4j graph database, wherein the relation triple structure is as follows: (src, r, dst), wherein the "src" includes the label and the unique identifier of the parent node in the tree, and the "dst" includes the label and the unique identifier of the child node in the tree, the specific content stored by the child node, and the sequence of the child node in the child node set where the child node is located.
Still further, in step S103, the process of generating the relationship graph structure includes:
step S301: storing the relation triple into a neo4j graph database, taking 'src' and 'dst' as the vertex of a graph, taking 'r' as the edge of the graph, representing the connection relation between each node, and generating a graph structure corresponding to an HTML text;
step S302: in the generated graph structure, all nodes containing body text exist in the end node of the graph, and the end node only containing the empty text and the nodes without the context attribute (namely the parent nodes of the original end nodes) generated by the end node are circularly removed.
Further, in step S104, the process of branch compression and node compression is as follows:
step S401: for an empty node connected to a single end node in the graph structure, directly connecting the single end node to a parent node of the empty node, and deleting the empty node is called node compression.
Step S402: for an empty node connected with two or more last nodes in the graph structure, directly connecting the empty node with a grandfather node of the empty node, and deleting the empty father node of the empty node, which is called branch compression.
Further, in step S105, the process of extracting the multidimensional feature includes:
step S501: the number of the non-empty nodes connected with the text nodes is far smaller than that of the non-empty nodes connected with the non-text nodes, and the nodes with more connected non-empty nodes are more likely to be the text nodes;
step S502: because there is only one text node in one web page and the average text length of the nodes containing the recommended content in the web page is far shorter than that of normal text nodes, the nodes with longer average text length are more likely to be the text nodes.
The invention provides a webpage text content extraction method, which comprises the steps of preprocessing HTML source codes of a webpage, removing a style template irrelevant to text content in the HTML source codes, and generating an HTML text file; extracting tags in an HTML text and generating a tree structure of a source HTML through an HTML processing technology; extracting the relationship among all nodes and the sequence relationship among the sub-nodes by traversing all the nodes in the tree, and generating a relationship triple; converting the relation triple into a graph by using a neo4j graph database, and performing node compression and branch compression on part of nodes in the graph to remove redundant empty nodes; extracting features from the graph through a natural language processing technology, training an MLP classification model, and separating text nodes from a large number of nodes; and (4) recovering the contents in all the child nodes of the text node in sequence to extract the complete text content of the webpage.
Drawings
Fig. 1 is a flowchart of a method for extracting text content of a web page according to an embodiment of the present application.
Fig. 2 is a storage format diagram of a relationship triplet generated in the present application.
Fig. 3 is a schematic diagram of node compression described in the present application.
Fig. 4 is a schematic view of the compression of the branches described in the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application.
Referring to fig. 1, fig. 1 is a flowchart of a method for extracting text content of a web page according to an embodiment of the present application.
The method comprises the following specific steps.
Step S101: the method comprises the steps of acquiring HTML source codes of a webpage from an open source channel by using a simulated browser request technology, preprocessing the HTML source codes to acquire HTML labels, and converting the webpage source codes into a tree structure.
Removing template styles such as CSS and the like which are irrelevant to text content by an HTML file processing technology, and only keeping an HTML plain text; extracting tags such as < div >, < table > and the like in the HTML text by using the Beautiful Soup, and generating a tree structure according to the hierarchical relationship among the tags.
Step S102: and traversing the nodes in the tree, and extracting the triples according to the connection relation among the nodes and the sequence relation among the sub-nodes.
First, according to the generated tree structure, all the level one nodes under the original < html > tag, that is, the child nodes directly connected to the < html > tag in the tree structure, are obtained. Recording the relation between the primary nodes and the html tags as subtag, and recording the sequence of the primary nodes.
And then, circularly traversing all child nodes under the primary node, recording the relation between each node and the child nodes thereof as subtag, and recording the sequential relation among the child nodes.
Then, the result is converted into a relational triple structure and stored in a neo4j graph database, and the relational triple structure is stored in a format shown in fig. 2. In fig. 2, "src" and "dst" respectively represent nodes in a tree structure, and "r" represents a connection relationship between two nodes; the label and the unique identifier of a parent node in the tree are included in the src, the label and the unique identifier of a child node in the tree, the specific content stored by the child node and the sequence of the child node in the child node set in the dst are included in the dst, the unique _ id serves as an identification function, and the child _ sequence represents the sequence of the child node.
Step S103: and converting the relation triples into a graph through a neo4j graph database, and simplifying the graph.
And taking src and dst in the relation triple structure as the vertex of the graph, taking r as the edge of the graph, and generating the graph structure corresponding to the HTML text.
According to the structural characteristics of HTML, in the generated graph structure, all nodes containing characters are necessarily present at the end node of the graph. Based on this basis, the loop removes the end node that is empty text and the resulting nodes that do not have the context attribute (i.e., the parent of the original end node). The main purpose of this step is to reduce redundancy and complexity of subsequent processing.
Step S104: and carrying out node compression and branch compression on the graph, and further removing redundant nodes.
The nodes with parallel structure relationship in the HTML structure are related to the same node, and the nodes related to the same subject are related to the same node or similar nodes. Thus all nodes containing text information are associated with one and the same node, either directly or within a certain distance. Based on the idea, the influence of data discretization brought by the HTML label structure is eliminated through a compression step.
(1) The specific steps of node compression are shown in fig. 3.
For an empty node connected to a single end node in the graph structure, directly connecting the single end node to a parent node of the empty node, and deleting the empty node is called node compression. Referring to fig. 3, the empty node B connected to a single end node has both its parent node and its grandparent node empty, so that the node B is a redundant empty node. And deleting the empty node B, and directly connecting the end node to the parent node of the node B.
(2) The specific steps of branch compression are shown in fig. 4.
For an empty node connected with two or more end nodes in the graph structure, the empty node is directly connected with a grandfather node of the empty node, and the empty father node of the empty node is deleted, which is called branch node compression. Referring to fig. 4, a plurality of empty nodes C connected to the end node have empty nodes both of the father node and the grandfather node, so that the father node of the node C is a redundant empty node. And deleting the father node of the empty node C, and directly connecting the empty node C with the grandfather node thereof.
Step S105: and carrying out multi-dimensional feature extraction on the nodes in the graph structure by a natural language processing technology, wherein the specific steps comprise.
(1) And calculating the number of nodes which are not empty and connected with each node as a first characteristic. In the generated graph, the number of non-null nodes connected to the body nodes should be much smaller than the number of non-null nodes connected to the non-body nodes.
(2) And calculating the average text length of each node as a second characteristic. Since there is one and only one text node in one web page, there are many recommended contents or rated contents related to subjects in many web sites, and these contents, although not belonging to the text contents of the web page, can also generate a graph structure. These nodes containing recommended content have the following characteristics: the average text length is much shorter than the average text length of normal body nodes.
Step S106: HTML source codes of the webpage are obtained from an open source channel by using a simulated browser request technology, training samples are collected, the multi-dimensional features are extracted, and an MLP classification model is trained through machine learning.
Firstly, an HTML source code of a webpage is obtained from an open source channel by utilizing a simulated browser request technology, a certain number of webpage samples are collected, multidimensional features in the samples are extracted by using the method in the step S105, and an MLP classification model is trained by using the extracted features. The trained MLP classification model can divide all nodes in the graph into text nodes and non-text nodes according to the characteristics of the text nodes.
Step S107: and for the extracted text node, restoring the content in the child node according to the sequence of the child nodes recorded by the child sequence in the database so as to obtain the complete text content.
The above examples are merely illustrative of the technical solutions of the present invention and not restrictive, and a person skilled in the art may modify the technical solutions of the present invention or substitute them with equivalents without departing from the spirit and scope of the present invention, which should be determined by the claims.

Claims (7)

1. A method for automatically extracting webpage text content based on a neo4j graphic database is characterized by comprising the following steps:
step S101, acquiring HTML source codes of a webpage from an open source channel by using a simulation browser request technology, preprocessing the HTML source codes to acquire HTML labels, and converting the webpage source codes into a tree structure;
step S102, traversing all nodes in the tree structure, and extracting a triple representing the relationship between the nodes according to the connection relationship between the nodes and the sequence relationship between the sub-nodes;
step S103, converting the relation triple structure into a graph structure by using a neo4j graph database;
step S104, in the graph structure, dividing the empty nodes directly connected with the end nodes into two types according to the number of the end nodes connected with the empty nodes, and respectively performing node compression and branch compression;
step S105, extracting node quantity characteristics and average text length characteristics from the compressed graph to generate a characteristic vector;
step S106, using the characteristic vector to perform machine learning, training a text node classification model, and classifying nodes in the webpage by using the classification model so as to automatically extract text nodes in the webpage;
and S107, sequentially restoring the contents in the child nodes according to the sequence of the child nodes of the extracted text nodes, and extracting the complete text contents of the webpage.
2. The method for automatically extracting the text contents of web pages based on the neo4j graph database as claimed in claim 1, wherein said step S102 further comprises the steps of:
step S201, for the preprocessed and converted tree structure, circularly traversing all nodes in the tree structure, for all child nodes under each node, sequentially recording the sequence from left to right, and for all nodes under the < html > label, generating records in the form of { parent node, 'connection relation', [ child node, xth child node ] };
step S202, converting the records into a relation triple structure and storing the relation triple structure into a neo4j graph database, wherein the structure of the relation triple structure is as follows: (src, r, dst), wherein "src" and "dst" respectively represent nodes in the tree structure, and "r" represents a connection relationship between two nodes; the label and the unique identifier of the parent node in the tree are included in the "src", and the label and the unique identifier of the child node in the tree, the specific content stored by the child node, and the sequence of the child node in the child node set where the child node is located are included in the "dst".
3. The method for automatically extracting the text contents of web pages based on the neo4j graphic database as claimed in claim 2, wherein said step S103 further comprises the steps of:
step S301, storing the relation triple into a neo4j graph database, wherein the "src" and the "dst" are used as vertexes of a graph, and the "r" is used as an edge of the graph, so as to generate a graph structure corresponding to an HTML text;
step S302, in the generated graph structure, all nodes containing the text of the body exist in the end node of the graph, and the end node only containing the empty text and the nodes without the context attribute generated by the end node are circularly removed.
4. The method for automatically extracting the text contents of web pages based on the neo4j graphic database as claimed in claim 1, wherein said step S104 further comprises the steps of:
step S401, for the empty node connected with the single end node in the graph structure, directly connecting the single end node with the father node of the empty node, and deleting the empty node, which is called node compression;
step S402, for the empty nodes connected with two or more last nodes in the graph structure, directly connecting the empty nodes with grandfather nodes of the empty nodes, and deleting the empty father nodes of the empty nodes, which is called branch node compression.
5. The method for automatically extracting the text contents of web pages based on neo4j graphic database as claimed in claim 1, wherein said step S105 further comprises the steps of:
step S501, calculating the number of nodes which are not empty and connected with each node as a first feature;
step S502, calculating the average text length of the connected nodes of each node as a second characteristic;
step S503, combining the first characteristic and the second characteristic to generate a characteristic vector, and training by using a machine learning model to obtain a text node classification model;
step S504, for the web pages of the text content to be extracted, the features such as feature one and feature two are extracted, the nodes in the graph are divided into text nodes and non-text nodes through a text node classification model, and only one text node can be extracted from each web page.
6. The method for automatically extracting the body content of a web page based on the neo4j graphic database as claimed in claim 1, wherein the step S106 further comprises the steps of:
step S601, extracting the multi-dimensional features in the sample by using the method in the step S105, and training a machine learning model by using the extracted features;
step S602, for the webpage of which the text content is to be extracted, dividing all nodes in the webpage into text nodes and non-text nodes according to the characteristics of the text nodes by using a trained classification model.
7. The method for automatically extracting the text contents of web pages based on the neo4j graph database as claimed in claim 2, wherein said step S107 further comprises the steps of:
and recovering the contents in the child nodes according to the child node sequence recorded in the step S202 for all the child nodes connected with the text node, thereby extracting the text contents of the webpage.
CN202010138403.8A 2020-03-03 2020-03-03 Method for automatically extracting webpage text content based on neo4j graphic database Active CN113343140B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010138403.8A CN113343140B (en) 2020-03-03 2020-03-03 Method for automatically extracting webpage text content based on neo4j graphic database

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010138403.8A CN113343140B (en) 2020-03-03 2020-03-03 Method for automatically extracting webpage text content based on neo4j graphic database

Publications (2)

Publication Number Publication Date
CN113343140A CN113343140A (en) 2021-09-03
CN113343140B true CN113343140B (en) 2022-12-13

Family

ID=77467355

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010138403.8A Active CN113343140B (en) 2020-03-03 2020-03-03 Method for automatically extracting webpage text content based on neo4j graphic database

Country Status (1)

Country Link
CN (1) CN113343140B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023209640A1 (en) * 2022-04-29 2023-11-02 Content Square SAS Determining zone types of a webpage

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102663023A (en) * 2012-03-22 2012-09-12 浙江盘石信息技术有限公司 Implementation method for extracting web content
CN102915361A (en) * 2012-10-18 2013-02-06 北京理工大学 Webpage text extracting method based on character distribution characteristic
CN103559202A (en) * 2013-10-08 2014-02-05 北京奇虎科技有限公司 Webpage content extracting device and method
CN103853760A (en) * 2012-12-03 2014-06-11 中国移动通信集团公司 Method and device for extracting contents of bodies of web pages
CN104268148A (en) * 2014-08-27 2015-01-07 中国科学院计算技术研究所 Forum page information auto-extraction method and system based on time strings
WO2016036760A1 (en) * 2014-09-03 2016-03-10 Atigeo Corporation Method and system for searching and analyzing large numbers of electronic documents
CN106557565A (en) * 2016-11-22 2017-04-05 福州大学 A kind of text message extracting method based on website construction
CN107391678A (en) * 2017-07-21 2017-11-24 福州大学 Web page content information extracting method based on cluster
CN109740097A (en) * 2018-12-29 2019-05-10 温州大学瓯江学院 A kind of Web page text extracting method of logic-based chained block

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102663023A (en) * 2012-03-22 2012-09-12 浙江盘石信息技术有限公司 Implementation method for extracting web content
CN102915361A (en) * 2012-10-18 2013-02-06 北京理工大学 Webpage text extracting method based on character distribution characteristic
CN103853760A (en) * 2012-12-03 2014-06-11 中国移动通信集团公司 Method and device for extracting contents of bodies of web pages
CN103559202A (en) * 2013-10-08 2014-02-05 北京奇虎科技有限公司 Webpage content extracting device and method
CN104268148A (en) * 2014-08-27 2015-01-07 中国科学院计算技术研究所 Forum page information auto-extraction method and system based on time strings
WO2016036760A1 (en) * 2014-09-03 2016-03-10 Atigeo Corporation Method and system for searching and analyzing large numbers of electronic documents
CN106557565A (en) * 2016-11-22 2017-04-05 福州大学 A kind of text message extracting method based on website construction
CN107391678A (en) * 2017-07-21 2017-11-24 福州大学 Web page content information extracting method based on cluster
CN109740097A (en) * 2018-12-29 2019-05-10 温州大学瓯江学院 A kind of Web page text extracting method of logic-based chained block

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
"FiVaTech: Page-Level Web Data Extraction from Template Pages";Mohammed Kayed 等;《IEEE Transactions on Knowledge and Data Engineering》;20090417;第22卷(第2期);249-263 *
"基于可视块的多记录型复杂网页信息提取算法";王卫红 等;《计算机科学》;20190812;第46卷(第10期);63-70 *
"基于网页聚类的正文信息提取方法";王一洲 等;《小型微型计算机系统》;20180115;第39卷(第01期);111-115 *
"基于节点属性与正文内容的海量Web信息抽取方法";王海艳 等;《通信学报》;20161025;第37卷(第10期);9 *

Also Published As

Publication number Publication date
CN113343140A (en) 2021-09-03

Similar Documents

Publication Publication Date Title
KR100324456B1 (en) Structured document searching display method and apparatus
CN109492077A (en) The petrochemical field answering method and system of knowledge based map
Sanoja et al. Block-o-matic: A web page segmentation framework
CN107392143A (en) A kind of resume accurate Analysis method based on SVM text classifications
WO2017080090A1 (en) Extraction and comparison method for text of webpage
CN101727498A (en) Automatic extraction method of web page information based on WEB structure
CN103810251B (en) Method and device for extracting text
WO2008008213A2 (en) Interactively crawling data records on web pages
CN106960058A (en) A kind of structure of web page alteration detection method and system
CN111737623A (en) Webpage information extraction method and related equipment
CN105740355B (en) Webpage context extraction method and device based on aggregation text density
CN112307303A (en) Efficient and accurate network page duplicate removal system based on cloud computing
CN111428503A (en) Method and device for identifying and processing same-name person
CN117312711A (en) Search engine optimization method and system based on AI analysis
CN116244476A (en) Method and system for realizing pre-labeling front-end visualization based on rich text
CN107436931B (en) Webpage text extraction method and device
CN113343140B (en) Method for automatically extracting webpage text content based on neo4j graphic database
CN109271616A (en) A kind of intelligent extract method based on normative document questions record characteristic value
CN112989811B (en) History book reading auxiliary system based on BiLSTM-CRF and control method thereof
JPH11110384A (en) Method and device for retrieving and displaying structured document
CN112433995A (en) File format conversion method, system, computer equipment and storage medium
US20080015843A1 (en) Linguistic Image Label Incorporating Decision Relevant Perceptual, Semantic, and Relationships Data
US10628632B2 (en) Generating a structured document based on a machine readable document and artificial intelligence-generated annotations
JP2023010805A (en) Method for training document information extraction model and extracting document information, device, electronic apparatus, storage medium and computer program
CN113434797B (en) Webpage information extraction method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant