CN110851679A - Method and system for extracting webpage text based on text node characteristics - Google Patents

Method and system for extracting webpage text based on text node characteristics Download PDF

Info

Publication number
CN110851679A
CN110851679A CN201910947241.XA CN201910947241A CN110851679A CN 110851679 A CN110851679 A CN 110851679A CN 201910947241 A CN201910947241 A CN 201910947241A CN 110851679 A CN110851679 A CN 110851679A
Authority
CN
China
Prior art keywords
text
node
webpage
value
html
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910947241.XA
Other languages
Chinese (zh)
Inventor
杨永全
翟世平
魏志强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ocean University of China
Qingdao National Laboratory for Marine Science and Technology Development Center
Original Assignee
Ocean University of China
Qingdao National Laboratory for Marine Science and Technology Development Center
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ocean University of China, Qingdao National Laboratory for Marine Science and Technology Development Center filed Critical Ocean University of China
Priority to CN201910947241.XA priority Critical patent/CN110851679A/en
Publication of CN110851679A publication Critical patent/CN110851679A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a method and a system for extracting webpage text based on text node characteristics, and belongs to the technical field of internet. The method comprises the following steps: acquiring an HTML source code of a webpage to be extracted; filtering the HTML source code, extracting element key nodes in an HTML DOM tree of the HTML source code, and constructing a list aiming at the element key nodes; acquiring the probability that the node value of each element key node is the attribute value of the text node and sequencing; and extracting the text elements of the key element nodes according to the probability sequence of the attribute values of the text nodes, and determining the text of the webpage to be judged as the text of the webpage. In the webpage text extraction process, the important function of the attribute node of the HTML webpage DOM tree element on the marked text node is considered, the key attribute values id and class of the webpage node are compared with the attribute value characteristics of the text node, the text node value is accurately found, and the text is accurately extracted by combining the HTML parser technology.

Description

Method and system for extracting webpage text based on text node characteristics
Technical Field
The present invention relates to the field of internet technologies, and in particular, to a method and a system for extracting a webpage text based on text node features.
Background
In the context of WEB mass information processing, demands such as WEB intelligent information retrieval, document automatic summarization, public opinion analysis and the like are brought forward. These requirements are the process of collecting and analyzing the vast number of WEB pages in the internet. Generally, this kind of technology uses web crawlers to capture information of original web pages from the web, and the original information usually contains various web noise data, such as advertisement links, tag information, navigation links, comments, etc., in addition to the text information of interest to the user. The existence of the noise data greatly influences the efficiency of network retrieval and also reduces the reading efficiency of people. The method has the advantages that the article texts are accurately and efficiently extracted from the semi-structured HTML source files with strong isomerism, and the method has important significance in the fields of data mining, information retrieval and the like based on the Internet.
With the rapid development of the internet, data carried by the WEB is increasing day by day, and the problems of information redundancy, various forms, difficult processing and the like are more and more prominent, so that WEB information extraction is generated at the same time. And because the WEB page contains a large amount of information irrelevant to the subject, the quick positioning and text content acquisition of the user are influenced. Therefore, the extraction of the page text information is very important, which not only can save a great deal of time and energy of users, but also can be used for various aspects such as data mining and the like. The WEB information extraction is mainly directed at unstructured or semi-structured WEB pages, and the mainstream is mostly based on an HTML structure. In the existing related research, when paying attention to the HTML element, a researcher ignores the influence of semantic information of the attribute tag on the content contained in the HTML element, so that a text node cannot be correctly found, the text content is difficult to extract, and the extraction efficiency is low.
Webpage text extraction technology:
in the field of webpage text extraction at present, an HTML page can be analyzed into a DOM tree, all tags, text information and the like in the page can be converted into a node in the tree, and data extraction can be converted into operation on one tree. Due to the structural advantages, information extraction based on the HTML structure gradually becomes the mainstream of research, and the method has a good effect of extracting the webpage text based on statistical learning and text features. The method is used for the single text and the multi-text web page
The method comprises the steps of constructing a webpage into a label tree, acquiring a path from a root node to a leaf node (the leaf node which must contain text) through statistical learning, automatically learning text features on the path, finding out the path with the same text features, finding out a text region and a sub-tree trunk, finding out a similar sub-tree trunk in the text region according to the learned text features, and finally pruning the content in the acquired text region to obtain the main information of the page. Although the method can effectively extract the text information, the path marking is required in advance, the learning process is long, and the method is not suitable for the blog web pages.
String similarity measurement techniques:
the measurement of the similarity of the character strings is to find a common sub-string of the two character strings, and the similarity of the two character strings is measured according to a corresponding formula by utilizing the length of the common sub-string. String similarity has wide application in many fields. For example, the method is applied to the fields of plagiarism detection systems, automatic scoring systems, code plagiarism prevention systems, data cleaning, webpage searching, DNA sequence matching and the like. At present, a plurality of character string similarity measurement algorithms exist, such as an edit distance algorithm, a longest common substring algorithm, a Heckel algorithm, a greedy character string matching algorithm, an RKR-GST algorithm and the like. Due to different implementation principles, the obtained similarity of the character strings is different, and further, the application fields are different.
Disclosure of Invention
In order to solve the problems, the invention provides a method for extracting a webpage text based on text node characteristics, which comprises the following steps:
acquiring an HTML source code of a webpage to be extracted;
filtering the HTML source code, extracting element key nodes in an HTML DOM tree of the HTML source code, and constructing a list aiming at the element key nodes;
comparing the node value of each element key node in the list with the node values in the node library in sequence, acquiring the probability that the node value of each element key node is the attribute value of the text node, and sequencing;
and extracting the text elements of the key element nodes according to the sequence of the probability of the attribute values of the text nodes, acquiring the text of the webpage to be judged, judging whether the text of the webpage to be judged exceeds a preset threshold value, and determining that the text of the webpage to be judged is the text of the webpage when the text of the webpage to be judged does not exceed the preset threshold value.
Optionally, the method of the present invention further includes:
and when the text of the webpage to be judged is determined to be the text of the webpage, the text elements are not extracted from the key nodes of the residual elements.
Optionally, the text element is extracted according to the attribute identifier of the text node of the webpage and the HTML parser.
Optionally, comparing the node value of each key node of the elements in the list with the node value in the text node library in sequence, and using an RKR-GST algorithm or an edit distance algorithm of a character string.
The invention also provides a system for extracting the webpage text based on the text node characteristics, which comprises the following steps:
the information acquisition module is used for acquiring HTML source codes of the webpage to be extracted;
the filtering module is used for filtering the HTML source code, extracting element key nodes in an HTML DOM tree of the HTML source code and constructing a list aiming at the element key nodes;
the sequencing module is used for comparing the node value of each element key node in the list with the node values in the node library in sequence, acquiring the probability that the node value of each element key node is the text node attribute value and sequencing the node values;
an extraction module for extracting the content of the content,
and extracting the text elements of the key element nodes according to the sequence of the probability of the attribute values of the text nodes, acquiring the text of the webpage to be judged, judging whether the text of the webpage to be judged exceeds a preset threshold value, and determining that the text of the webpage to be judged is the text of the webpage when the text of the webpage to be judged does not exceed the preset threshold value.
Optionally, the text element is extracted according to the attribute identifier of the text node of the webpage and the HTML parser.
Optionally, comparing the node value of each key node of the elements in the list with the node value in the text node library in sequence, and using an RKR-GST algorithm or an edit distance algorithm of a character string.
In the webpage text extraction process, the important function of the attribute node of the HTML webpage DOM tree element on the marked text node is considered, the key attribute values (id and class) of the webpage node are compared with the attribute value characteristics of the text node, the text node value is accurately found, and the text is accurately extracted by combining the HTML parser technology.
Drawings
FIG. 1 is a flow chart of a method for extracting a webpage text based on text node features according to the present invention;
FIG. 2 is a diagram of a system for extracting a web page text based on text node features according to the present invention.
Detailed Description
The method combines the text node characteristics and the HTML parser technology to realize the efficient and accurate extraction of the webpage text.
The invention provides a method for extracting webpage text based on text node characteristics, as shown in figure 1, comprising the following steps:
acquiring an HTML source code of a webpage to be extracted;
filtering the HTML source code, extracting element key nodes in an HTML DOM tree of the HTML source code, and constructing a list aiming at the element key nodes;
comparing the node value of each element key node in the list with the node values in the node library in sequence, acquiring the probability that the node value of each element key node is the attribute value of the text node and sequencing the node values, and comparing the node value of each element key node in the list with the node value in the text node library in sequence by using an RKR-GST algorithm or a character string editing distance algorithm;
the text node feature comparison used in the invention uses RKR-GST algorithm, GST (greedy StringTiling), which is a short name of greedy string covering algorithm, and is a similarity calculation measuring method based on token. The basic idea of the algorithm is: how two strings are completely equal, a maximum value is returned, which is exactly equal to the length of the string; if the two strings do not match at all, then the minimum value of 0 is reversed. We generally refer to the longer strings as text strings and the shorter strings as schema strings. The KR algorithm, KR (Karp-Rabin), is a short term for the random string matching algorithm, and can quickly find out the position of the first occurrence of the mode string in the text string. The basic idea of the algorithm is: first, a fixed-length pattern string is used to calculate a corresponding hash value by a hash function. Secondly, a corresponding hash value is calculated by the substring with fixed length through the hash function. And thirdly, if the hash value of the matched pattern string is the same as that of the text substring, the pattern string and the text substring can be matched, and if the pattern string and the text substring are not the same, the pattern string and the text substring are not matched.
RKR-GST algorithm combines the advantages of GST algorithm and KR algorithm. The basic idea of the algorithm is: each element of the pattern string in the text string does not need to be compared one by one, and only when the hash value of the pattern string substring is the same as the hash value of the text string substring, the pattern matching algorithm is high in operation efficiency.
The algorithm needs to calculate the hash value, define a base number b and a prime number q, and is a character string C with the length of k1C2...CkThe hash value of (d) is:
Hash(c1c2...ck)=(ask(c1)×bk-1+ask(c2)×bk-2+...+ask(ck-1)×b+ask(ck) Mod q string C2C3...Ck+1The hash value of (d) is:
Hash(c2c3...ck+1)=(Hash(c1c2...ck)-ask(c1)×bk-1×b+ask(ck+1))modq
firstly, RKR-GST needs to create two tag string object text structures;
secondly, creating a hash table of a corresponding mark string with a specified length;
and thirdly, finding substrings with the same hash values through circulation.
Analysis of the RKR-GST algorithm.
Due to the introduction of the hash table mode, when the similarity of two strings is judged, the number of the sub strings used for matching is small, the efficiency is improved, and the average time complexity can be controlled to be O (n) -O (n)2) In the worst case, the complexity of time is O (n)3)。
Extracting the text elements of the key element nodes according to the sequence of the probability of the text node attribute values, extracting the text elements according to the webpage text node attribute identifiers and the HTML parser, acquiring the webpage text to be judged, judging whether the webpage text to be judged exceeds a preset threshold value, and determining the webpage text to be judged as the webpage text when the webpage text to be judged does not exceed the preset threshold value. And when the text of the webpage to be judged is determined to be the text of the webpage, the text elements are not extracted from the key nodes of the residual elements.
The invention also provides a system 200 for extracting the webpage text based on the text node characteristics, which comprises the following steps:
the information acquisition module 201 is used for acquiring HTML source codes of the webpage to be extracted;
the filtering module 202 is used for filtering the HTML source code, extracting element key nodes in an HTML DOM tree of the HTML source code and constructing a list aiming at the element key nodes;
the sorting module 203 compares the node value of each element key node in the list with the node values in the node library in sequence, obtains the probability that the node value of each element key node is the text node attribute value, sorts the node values, and compares the node value of each element key node in the list with the node value in the text node library in sequence by using an RKR-GST algorithm or a character string editing distance algorithm;
the extraction module 204 is used for extracting the text elements of the key element nodes according to the probability sequence of the text node attribute values, extracting the text elements according to the webpage text node attribute identifiers and the HTML parser, acquiring the webpage text to be judged, judging whether the webpage text to be judged exceeds a preset threshold value or not, and determining that the webpage text to be judged is the webpage text when the webpage text to be judged does not exceed the preset threshold value.
In the webpage text extraction process, the important function of the attribute node of the HTML webpage DOM tree element on the marked text node is considered, the key attribute values (id and class) of the webpage node are compared with the attribute value characteristics of the text node, the text node value is accurately found, and the text is accurately extracted by combining the HTML parser technology.

Claims (7)

1. A method for extracting webpage text based on text node features comprises the following steps:
acquiring an HTML source code of a webpage to be extracted;
filtering the HTML source code, extracting element key nodes in an HTML DOM tree of the HTML source code, and constructing a list aiming at the element key nodes;
comparing the node value of each element key node in the list with the node values in the node library in sequence, acquiring the probability that the node value of each element key node is the attribute value of the text node, and sequencing;
and extracting the text elements of the key element nodes according to the sequence of the probability of the attribute values of the text nodes, acquiring the text of the webpage to be judged, judging whether the text of the webpage to be judged exceeds a preset threshold value, and determining that the text of the webpage to be judged is the text of the webpage when the text of the webpage to be judged does not exceed the preset threshold value.
2. The method of claim 1, further comprising:
and when the text of the webpage to be judged is determined to be the text of the webpage, the text elements are not extracted from the key nodes of the residual elements.
3. The method of claim 1, wherein the extracting of the text element is performed according to a web page text node attribute identifier and an HTML parser.
4. The method as claimed in claim 1, wherein the comparing of the node value of each element key node in the list with the node value in the body node library in sequence uses RKR-GST algorithm or edit distance algorithm of character string.
5. A system for extracting a text of a web page based on text node features, the system comprising:
the information acquisition module is used for acquiring HTML source codes of the webpage to be extracted;
the filtering module is used for filtering the HTML source code, extracting element key nodes in an HTML DOM tree of the HTML source code and constructing a list aiming at the element key nodes;
the sequencing module compares the node value of each element key node in the list with the node values in the node library in sequence, obtains the probability that the node value of each element key node is the text node attribute value and sequences the node values;
and the extraction module is used for extracting the text elements of the key element nodes according to the sequence of the probability of the attribute values of the text nodes, acquiring the text of the webpage to be judged, judging whether the text of the webpage to be judged exceeds a preset threshold value or not, and determining that the text of the webpage to be judged is the text of the webpage when the text of the webpage to be judged does not exceed the preset threshold value.
6. The system of claim 5, wherein the extraction of the text element is performed according to the attribute identifier of the web page text node and an HTML parser.
7. The system of claim 5, wherein the comparing of the node value of each element key node in the list with the node value in the body node library in sequence uses RKR-GST algorithm or editing distance algorithm of character string.
CN201910947241.XA 2019-09-30 2019-09-30 Method and system for extracting webpage text based on text node characteristics Pending CN110851679A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910947241.XA CN110851679A (en) 2019-09-30 2019-09-30 Method and system for extracting webpage text based on text node characteristics

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910947241.XA CN110851679A (en) 2019-09-30 2019-09-30 Method and system for extracting webpage text based on text node characteristics

Publications (1)

Publication Number Publication Date
CN110851679A true CN110851679A (en) 2020-02-28

Family

ID=69597455

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910947241.XA Pending CN110851679A (en) 2019-09-30 2019-09-30 Method and system for extracting webpage text based on text node characteristics

Country Status (1)

Country Link
CN (1) CN110851679A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112699642A (en) * 2020-12-31 2021-04-23 医渡云(北京)技术有限公司 Index extraction method and device for complex medical texts, medium and electronic equipment

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112699642A (en) * 2020-12-31 2021-04-23 医渡云(北京)技术有限公司 Index extraction method and device for complex medical texts, medium and electronic equipment
CN112699642B (en) * 2020-12-31 2023-03-28 医渡云(北京)技术有限公司 Index extraction method and device for complex medical texts, medium and electronic equipment

Similar Documents

Publication Publication Date Title
CN107229668B (en) Text extraction method based on keyword matching
Chakrabarti et al. Page-level template detection via isotonic smoothing
CN110888849A (en) Online log analysis method and system and electronic terminal equipment thereof
Ferrara et al. Automatic wrapper adaptation by tree edit distance matching
CN104268148A (en) Forum page information auto-extraction method and system based on time strings
Urvoy et al. Tracking Web Spam with Hidden Style Similarity.
CN103678412A (en) Document retrieval method and device
CN103530429A (en) Webpage content extracting method
Stilo et al. Temporal semantics: Time-varying hashtag sense clustering
CN102722526B (en) Part-of-speech classification statistics-based duplicate webpage and approximate webpage identification method
Shang et al. A framework to construct knowledge base for cyber security
CN110851679A (en) Method and system for extracting webpage text based on text node characteristics
Tekli et al. Structural similarity evaluation between XML documents and DTDs
Wen et al. : An efficient entity extraction algorithm using two-level edit-distance
Klampfl et al. Reconstructing the logical structure of a scientific publication using machine learning
Pamulaparty et al. A near-duplicate detection algorithm to facilitate document clustering
CN103761312B (en) Information extraction system and method for multi-recording webpage
TWI534640B (en) Chinese network information monitoring and analysis system and its method
Tekli et al. A fine-grained XML structural comparison approach
Qiu et al. Automatic information extraction from e-commerce web sites
Gao et al. Detecting data records in semi-structured web sites based on text token clustering
Barua et al. Removing noise content from online news articles
Wei et al. Improving database quality through eliminating duplicate records
Wen et al. An approach for XML similarity join using tree serialization
Li et al. Approximate joins for XML at label level

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination