CN107391559B - General forum text extraction algorithm based on block, pattern recognition and line text - Google Patents
General forum text extraction algorithm based on block, pattern recognition and line text Download PDFInfo
- Publication number
- CN107391559B CN107391559B CN201710427648.0A CN201710427648A CN107391559B CN 107391559 B CN107391559 B CN 107391559B CN 201710427648 A CN201710427648 A CN 201710427648A CN 107391559 B CN107391559 B CN 107391559B
- Authority
- CN
- China
- Prior art keywords
- text
- forum
- matching
- extraction
- html
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/957—Browsing optimisation, e.g. caching or content distillation
- G06F16/9577—Optimising the visualization of content, e.g. distillation of HTML documents
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The text extraction of the forum obtains the core content in the forum by analyzing the html file of the forum, and the text information extracted from the core content has great significance for business decision, public opinion analysis and social investigation. The technology has two important steps, namely denoising of the html text and identification and extraction of core content, wherein the denoising of the html text needs to remove useless information segments in the html text, and the identification and extraction of the core content are greatly different according to a method designed by an author. In the invention, a general forum extraction method based on block, pattern recognition and line text is provided mainly for the recognition and extraction of core contents, the core contents of forum texts are more accurately extracted by self-updating of machine learning realization patterns and blocking of forum html files, and the method has universality in the face of forums realized by various methods, thereby avoiding the complexity caused by the need of designing different methods for extracting different forums.
Description
Technical Field
The invention relates to forum text extraction, in particular to universal forum text extraction.
Background
Pattern recognition: pattern recognition refers to the process of processing and analyzing various forms of information (numerical, textual, and logical) that characterize a thing or phenomenon to describe, recognize, classify, and interpret the thing or phenomenon.
And (3) forum text extraction: the forum text extraction means that redundant parts in the web pages are removed, and only the most core contents in the forum are extracted, wherein the extraction includes personal information, text contents and content publishing time of the posters and the respondents. The existing forum text extraction technology can only extract a certain specific webpage.
Minimum edit distance: the purpose is to find out how many characters need to be changed between two strings and then become consistent. The method uses an algorithm strategy of dynamic programming, the problem has an optimal substructure, the minimum editing distance comprises a sub-minimum editing distance, and the following formula is provided:
d[i,j]: represents a character string X [0, 1, 2, 3, i ]]And a character string X [0, 1, 2, 3, j ]]The distance between
xi: represents a character string X [0, 1, 2, 3, i ]]The ith element in
xj: represents a character string X [0, 1, 2, 3, i ]]The j (th) element of (1)
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provide a general forum extraction method based on block, mode recognition and line text. A pattern library is generated through matching of the processed html text, then the webpage is partitioned according to time, the text is extracted by using the patterns in the blocks, and meanwhile, the patterns can generate new patterns by using the existing patterns to achieve the effect of high accuracy, so that the algorithm has the characteristics of universality and self-learning. And partitioning the html files of the forum by using the time and the line text density, and extracting the text content of the partitions by using an html text matching generation mode. The patterns may be self-updated during the course of matching.
The method overcomes the defect that the traditional forum text extraction method has no universality or extremely low universality, and the universality of the method ensures that different methods do not need to be designed for extracting contents aiming at different forums. And the problem of low processing speed of the traditional method for extracting the text based on the dom tree is solved. The universality of forum text extraction is improved by 10-30%; the performance is improved by about 20 percent compared with the extraction of the traditional text.
Drawings
FIG. 1 is a flow chart of the method.
Detailed Description
The present invention will be described in further detail with reference to examples and drawings, but the present invention is not limited thereto.
As shown in fig. 1, the general forum extraction method based on block, pattern recognition and line text specifically includes the following steps:
the first step is as follows: and removing all webpage labels, script functions and blank lines by using the regular expression.
The second step is that: and continuously carrying out pattern matching according to the provided target data to find out the texts in front of and behind the target field, wherein the found front and back character strings are the pattern of the target field.
The third step: saving the mode and the target website url into a file.
The fourth step: and matching the input url with the shortest minimum editing distance in the pattern library according to the input url, and finding the pattern.
The fifth step: and partitioning according to the time and the line text density, extracting texts in forum webpages according to the mode, and finding dates.
And a sixth step: and judging the text content in each block, and judging whether the publication author is empty, if one of the two items is empty, determining that the two items are errors, and calculating the total error rate.
The seventh step: if the error rate is above the user-entered threshold, return to the pattern library to find the url for which the Hamming distance in the pattern library is only shorter than the last pattern. And returning to the fourth step. If the number of times of re-matching is equal to four, the eighth step is executed. And returning to the ninth step if the number of times of re-matching is less than four.
Eighth step: and sorting the error rates of the submodes of different target fields of the first four modes, and respectively selecting the submode with the lowest error rate to generate a new mode, wherein url is an input url. And returning to the fourth step.
The ninth step: and outputting the extracted text information to a file.
The generated pattern library can be stored in a file, and can be written by utilizing python, urllib, re, json Levenshtein and http libraries in one step according to an algorithm flow chart.
The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.
Claims (4)
1. A general forum extraction method based on block, pattern recognition and line text is characterized in that: generating a pattern library through the matching of the processed html text, then blocking the webpage according to time and extracting the text in the blocks by using patterns;
the method comprises the following specific steps:
the first step is as follows: removing all webpage labels, script functions and blank lines by using a regular expression;
the second step is that: continuously performing mode matching according to the provided target data to find out the texts in front of and behind the target field, wherein the found front and back character strings are the modes of the target field;
the third step: saving the mode and the url of the target website into a file;
the fourth step: matching the input url with the shortest minimum editing distance in a pattern library according to the input url, and finding the pattern;
the fifth step: partitioning according to time and the line text density, extracting texts in forum webpages according to the modes, and finding dates;
and a sixth step: judging the text content in each block, and if the publication author is empty, if one of the two items is empty, the publication author is wrong, and calculating the total error rate;
the seventh step: if the error rate is higher than the threshold value input by the user, returning to the pattern library, finding the url of which the hamming distance in the pattern library is only shorter than the last pattern, and returning to the fourth step; if the number of times of re-matching is equal to four, executing the eighth step, and if the number of times of re-matching is less than four, returning to the ninth step;
eighth step: sorting the error rates of the submodes of different target fields of the first four modes, respectively selecting the submode with the lowest error rate to generate a new mode, wherein url is an input url, and returning to the fourth step;
the ninth step: and outputting the extracted text information to a file.
2. The method of claim 1, wherein: and the method provides the method for blocking the html file of the forum by using time and the line text density to block the webpage.
3. The method of claim 1, wherein: and extracting the segmented text content by using an html text matching generation mode.
4. The method of claim 1, wherein: the patterns may be self-updated during the matching process.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710427648.0A CN107391559B (en) | 2017-06-08 | 2017-06-08 | General forum text extraction algorithm based on block, pattern recognition and line text |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710427648.0A CN107391559B (en) | 2017-06-08 | 2017-06-08 | General forum text extraction algorithm based on block, pattern recognition and line text |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107391559A CN107391559A (en) | 2017-11-24 |
CN107391559B true CN107391559B (en) | 2020-06-02 |
Family
ID=60333246
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710427648.0A Active CN107391559B (en) | 2017-06-08 | 2017-06-08 | General forum text extraction algorithm based on block, pattern recognition and line text |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107391559B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110119484B (en) * | 2019-03-27 | 2021-04-06 | 湖南星汉数智科技有限公司 | Webpage release time extraction method and device, computer device and computer readable storage medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102567530A (en) * | 2011-12-31 | 2012-07-11 | 凤凰在线(北京)信息技术有限公司 | Intelligent extraction system and intelligent extraction method for article type web pages |
CN102810097A (en) * | 2011-06-02 | 2012-12-05 | 高德软件有限公司 | Method and device for extracting webpage text content |
CN103714176A (en) * | 2014-01-08 | 2014-04-09 | 同济大学 | Webpage text extraction method based on maximum text density |
CN105808545A (en) * | 2014-12-30 | 2016-07-27 | Tcl集团股份有限公司 | Forum data extraction method and forum data extraction apparatus |
US9448711B2 (en) * | 2005-05-23 | 2016-09-20 | Nokia Technologies Oy | Mobile communication terminal and associated methods |
-
2017
- 2017-06-08 CN CN201710427648.0A patent/CN107391559B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9448711B2 (en) * | 2005-05-23 | 2016-09-20 | Nokia Technologies Oy | Mobile communication terminal and associated methods |
CN102810097A (en) * | 2011-06-02 | 2012-12-05 | 高德软件有限公司 | Method and device for extracting webpage text content |
CN102567530A (en) * | 2011-12-31 | 2012-07-11 | 凤凰在线(北京)信息技术有限公司 | Intelligent extraction system and intelligent extraction method for article type web pages |
CN103714176A (en) * | 2014-01-08 | 2014-04-09 | 同济大学 | Webpage text extraction method based on maximum text density |
CN105808545A (en) * | 2014-12-30 | 2016-07-27 | Tcl集团股份有限公司 | Forum data extraction method and forum data extraction apparatus |
Also Published As
Publication number | Publication date |
---|---|
CN107391559A (en) | 2017-11-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110321432B (en) | Text event information extraction method, electronic device and nonvolatile storage medium | |
CN107229668B (en) | Text extraction method based on keyword matching | |
CN105630941B (en) | Web body matter abstracting methods based on statistics and structure of web page | |
WO2017167067A1 (en) | Method and device for webpage text classification, method and device for webpage text recognition | |
Lin et al. | Mathematical formula identification in PDF documents | |
CN104598577B (en) | A kind of extracting method of Web page text | |
CN102591612B (en) | General webpage text extraction method based on punctuation continuity and system thereof | |
WO2011072434A1 (en) | System and method for web content extraction | |
Singh et al. | OCR++: a robust framework for information extraction from scholarly articles | |
CN106407195B (en) | Method and system for web page duplication elimination | |
US11031003B2 (en) | Dynamic extraction of contextually-coherent text blocks | |
CN104517106A (en) | List recognition method and system | |
CN104199845B (en) | Line Evaluation based on agent model discusses sensibility classification method | |
CN104850617A (en) | Short text processing method and apparatus | |
Leonandya et al. | A semi-supervised algorithm for Indonesian named entity recognition | |
Naoum et al. | Article segmentation in digitised newspapers with a 2d markov model | |
CN111046649A (en) | Text segmentation method and device | |
CN107391559B (en) | General forum text extraction algorithm based on block, pattern recognition and line text | |
Baraka et al. | Arabic text author identification using support vector machines | |
CN102622405B (en) | Method for computing text distance between short texts based on language content unit number evaluation | |
CN113297844B (en) | Method for detecting repeatability data based on doc2vec model and minimum editing distance | |
CN112100368B (en) | Method and device for identifying dialogue interaction intention | |
CN109344254B (en) | Address information classification method and device | |
CN104685514A (en) | Character recognition apparatus, method and program | |
CN103942188A (en) | Method and device for identifying corpus languages |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |