CN107391559B - General forum text extraction algorithm based on block, pattern recognition and line text - Google Patents

General forum text extraction algorithm based on block, pattern recognition and line text Download PDF

Info

Publication number
CN107391559B
CN107391559B CN201710427648.0A CN201710427648A CN107391559B CN 107391559 B CN107391559 B CN 107391559B CN 201710427648 A CN201710427648 A CN 201710427648A CN 107391559 B CN107391559 B CN 107391559B
Authority
CN
China
Prior art keywords
text
forum
matching
extraction
html
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710427648.0A
Other languages
Chinese (zh)
Other versions
CN107391559A (en
Inventor
龙鑫
武继刚
杨哲
左超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong University of Technology
Original Assignee
Guangdong University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong University of Technology filed Critical Guangdong University of Technology
Priority to CN201710427648.0A priority Critical patent/CN107391559B/en
Publication of CN107391559A publication Critical patent/CN107391559A/en
Application granted granted Critical
Publication of CN107391559B publication Critical patent/CN107391559B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/957Browsing optimisation, e.g. caching or content distillation
    • G06F16/9577Optimising the visualization of content, e.g. distillation of HTML documents

Abstract

The text extraction of the forum obtains the core content in the forum by analyzing the html file of the forum, and the text information extracted from the core content has great significance for business decision, public opinion analysis and social investigation. The technology has two important steps, namely denoising of the html text and identification and extraction of core content, wherein the denoising of the html text needs to remove useless information segments in the html text, and the identification and extraction of the core content are greatly different according to a method designed by an author. In the invention, a general forum extraction method based on block, pattern recognition and line text is provided mainly for the recognition and extraction of core contents, the core contents of forum texts are more accurately extracted by self-updating of machine learning realization patterns and blocking of forum html files, and the method has universality in the face of forums realized by various methods, thereby avoiding the complexity caused by the need of designing different methods for extracting different forums.

Description

General forum text extraction algorithm based on block, pattern recognition and line text
Technical Field
The invention relates to forum text extraction, in particular to universal forum text extraction.
Background
Pattern recognition: pattern recognition refers to the process of processing and analyzing various forms of information (numerical, textual, and logical) that characterize a thing or phenomenon to describe, recognize, classify, and interpret the thing or phenomenon.
And (3) forum text extraction: the forum text extraction means that redundant parts in the web pages are removed, and only the most core contents in the forum are extracted, wherein the extraction includes personal information, text contents and content publishing time of the posters and the respondents. The existing forum text extraction technology can only extract a certain specific webpage.
Minimum edit distance: the purpose is to find out how many characters need to be changed between two strings and then become consistent. The method uses an algorithm strategy of dynamic programming, the problem has an optimal substructure, the minimum editing distance comprises a sub-minimum editing distance, and the following formula is provided:
Figure DEST_PATH_GDA0001434842940000011
d[i,j]: represents a character string X [0, 1, 2, 3, i ]]And a character string X [0, 1, 2, 3, j ]]The distance between
xi: represents a character string X [0, 1, 2, 3, i ]]The ith element in
xj: represents a character string X [0, 1, 2, 3, i ]]The j (th) element of (1)
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provide a general forum extraction method based on block, mode recognition and line text. A pattern library is generated through matching of the processed html text, then the webpage is partitioned according to time, the text is extracted by using the patterns in the blocks, and meanwhile, the patterns can generate new patterns by using the existing patterns to achieve the effect of high accuracy, so that the algorithm has the characteristics of universality and self-learning. And partitioning the html files of the forum by using the time and the line text density, and extracting the text content of the partitions by using an html text matching generation mode. The patterns may be self-updated during the course of matching.
The method overcomes the defect that the traditional forum text extraction method has no universality or extremely low universality, and the universality of the method ensures that different methods do not need to be designed for extracting contents aiming at different forums. And the problem of low processing speed of the traditional method for extracting the text based on the dom tree is solved. The universality of forum text extraction is improved by 10-30%; the performance is improved by about 20 percent compared with the extraction of the traditional text.
Drawings
FIG. 1 is a flow chart of the method.
Detailed Description
The present invention will be described in further detail with reference to examples and drawings, but the present invention is not limited thereto.
As shown in fig. 1, the general forum extraction method based on block, pattern recognition and line text specifically includes the following steps:
the first step is as follows: and removing all webpage labels, script functions and blank lines by using the regular expression.
The second step is that: and continuously carrying out pattern matching according to the provided target data to find out the texts in front of and behind the target field, wherein the found front and back character strings are the pattern of the target field.
The third step: saving the mode and the target website url into a file.
The fourth step: and matching the input url with the shortest minimum editing distance in the pattern library according to the input url, and finding the pattern.
The fifth step: and partitioning according to the time and the line text density, extracting texts in forum webpages according to the mode, and finding dates.
And a sixth step: and judging the text content in each block, and judging whether the publication author is empty, if one of the two items is empty, determining that the two items are errors, and calculating the total error rate.
The seventh step: if the error rate is above the user-entered threshold, return to the pattern library to find the url for which the Hamming distance in the pattern library is only shorter than the last pattern. And returning to the fourth step. If the number of times of re-matching is equal to four, the eighth step is executed. And returning to the ninth step if the number of times of re-matching is less than four.
Eighth step: and sorting the error rates of the submodes of different target fields of the first four modes, and respectively selecting the submode with the lowest error rate to generate a new mode, wherein url is an input url. And returning to the fourth step.
The ninth step: and outputting the extracted text information to a file.
The generated pattern library can be stored in a file, and can be written by utilizing python, urllib, re, json Levenshtein and http libraries in one step according to an algorithm flow chart.
The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims (4)

1. A general forum extraction method based on block, pattern recognition and line text is characterized in that: generating a pattern library through the matching of the processed html text, then blocking the webpage according to time and extracting the text in the blocks by using patterns;
the method comprises the following specific steps:
the first step is as follows: removing all webpage labels, script functions and blank lines by using a regular expression;
the second step is that: continuously performing mode matching according to the provided target data to find out the texts in front of and behind the target field, wherein the found front and back character strings are the modes of the target field;
the third step: saving the mode and the url of the target website into a file;
the fourth step: matching the input url with the shortest minimum editing distance in a pattern library according to the input url, and finding the pattern;
the fifth step: partitioning according to time and the line text density, extracting texts in forum webpages according to the modes, and finding dates;
and a sixth step: judging the text content in each block, and if the publication author is empty, if one of the two items is empty, the publication author is wrong, and calculating the total error rate;
the seventh step: if the error rate is higher than the threshold value input by the user, returning to the pattern library, finding the url of which the hamming distance in the pattern library is only shorter than the last pattern, and returning to the fourth step; if the number of times of re-matching is equal to four, executing the eighth step, and if the number of times of re-matching is less than four, returning to the ninth step;
eighth step: sorting the error rates of the submodes of different target fields of the first four modes, respectively selecting the submode with the lowest error rate to generate a new mode, wherein url is an input url, and returning to the fourth step;
the ninth step: and outputting the extracted text information to a file.
2. The method of claim 1, wherein: and the method provides the method for blocking the html file of the forum by using time and the line text density to block the webpage.
3. The method of claim 1, wherein: and extracting the segmented text content by using an html text matching generation mode.
4. The method of claim 1, wherein: the patterns may be self-updated during the matching process.
CN201710427648.0A 2017-06-08 2017-06-08 General forum text extraction algorithm based on block, pattern recognition and line text Active CN107391559B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710427648.0A CN107391559B (en) 2017-06-08 2017-06-08 General forum text extraction algorithm based on block, pattern recognition and line text

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710427648.0A CN107391559B (en) 2017-06-08 2017-06-08 General forum text extraction algorithm based on block, pattern recognition and line text

Publications (2)

Publication Number Publication Date
CN107391559A CN107391559A (en) 2017-11-24
CN107391559B true CN107391559B (en) 2020-06-02

Family

ID=60333246

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710427648.0A Active CN107391559B (en) 2017-06-08 2017-06-08 General forum text extraction algorithm based on block, pattern recognition and line text

Country Status (1)

Country Link
CN (1) CN107391559B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110119484B (en) * 2019-03-27 2021-04-06 湖南星汉数智科技有限公司 Webpage release time extraction method and device, computer device and computer readable storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102567530A (en) * 2011-12-31 2012-07-11 凤凰在线(北京)信息技术有限公司 Intelligent extraction system and intelligent extraction method for article type web pages
CN102810097A (en) * 2011-06-02 2012-12-05 高德软件有限公司 Method and device for extracting webpage text content
CN103714176A (en) * 2014-01-08 2014-04-09 同济大学 Webpage text extraction method based on maximum text density
CN105808545A (en) * 2014-12-30 2016-07-27 Tcl集团股份有限公司 Forum data extraction method and forum data extraction apparatus
US9448711B2 (en) * 2005-05-23 2016-09-20 Nokia Technologies Oy Mobile communication terminal and associated methods

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9448711B2 (en) * 2005-05-23 2016-09-20 Nokia Technologies Oy Mobile communication terminal and associated methods
CN102810097A (en) * 2011-06-02 2012-12-05 高德软件有限公司 Method and device for extracting webpage text content
CN102567530A (en) * 2011-12-31 2012-07-11 凤凰在线(北京)信息技术有限公司 Intelligent extraction system and intelligent extraction method for article type web pages
CN103714176A (en) * 2014-01-08 2014-04-09 同济大学 Webpage text extraction method based on maximum text density
CN105808545A (en) * 2014-12-30 2016-07-27 Tcl集团股份有限公司 Forum data extraction method and forum data extraction apparatus

Also Published As

Publication number Publication date
CN107391559A (en) 2017-11-24

Similar Documents

Publication Publication Date Title
CN107229668B (en) Text extraction method based on keyword matching
CN105630941B (en) Web body matter abstracting methods based on statistics and structure of web page
Lin et al. Mathematical formula identification in PDF documents
CN102831184B (en) According to the method and system text description of social event being predicted to social affection
CN104598577B (en) A kind of extracting method of Web page text
CN102591612B (en) General webpage text extraction method based on punctuation continuity and system thereof
WO2011072434A1 (en) System and method for web content extraction
Singh et al. OCR++: a robust framework for information extraction from scholarly articles
US11031003B2 (en) Dynamic extraction of contextually-coherent text blocks
CN104517106A (en) List recognition method and system
CN104199845B (en) Line Evaluation based on agent model discusses sensibility classification method
CN106407195B (en) Method and system for web page duplication elimination
CN104850617A (en) Short text processing method and apparatus
US11630956B2 (en) Extracting data from documents using multiple deep learning models
Naoum et al. Article segmentation in digitised newspapers with a 2d markov model
Leonandya et al. A semi-supervised algorithm for Indonesian named entity recognition
CN107391559B (en) General forum text extraction algorithm based on block, pattern recognition and line text
CN109472020A (en) A kind of feature alignment Chinese word cutting method
CN111680146A (en) Method and device for determining new words, electronic equipment and readable storage medium
CN102622405B (en) Method for computing text distance between short texts based on language content unit number evaluation
Eldirdiery et al. Detecting and removing noisy data on web document using text density approach
CN113297844B (en) Method for detecting repeatability data based on doc2vec model and minimum editing distance
CN109344254B (en) Address information classification method and device
CN104685514A (en) Character recognition apparatus, method and program
CN116029280A (en) Method, device, computing equipment and storage medium for extracting key information of document

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant