CN107391559B

CN107391559B - General forum text extraction algorithm based on block, pattern recognition and line text

Info

Publication number: CN107391559B
Application number: CN201710427648.0A
Authority: CN
Inventors: 龙鑫; 武继刚; 杨哲; 左超
Original assignee: Guangdong University of Technology
Current assignee: Guangdong University of Technology
Priority date: 2017-06-08
Filing date: 2017-06-08
Publication date: 2020-06-02
Anticipated expiration: 2037-06-08
Also published as: CN107391559A

Abstract

The text extraction of the forum obtains the core content in the forum by analyzing the html file of the forum, and the text information extracted from the core content has great significance for business decision, public opinion analysis and social investigation. The technology has two important steps, namely denoising of the html text and identification and extraction of core content, wherein the denoising of the html text needs to remove useless information segments in the html text, and the identification and extraction of the core content are greatly different according to a method designed by an author. In the invention, a general forum extraction method based on block, pattern recognition and line text is provided mainly for the recognition and extraction of core contents, the core contents of forum texts are more accurately extracted by self-updating of machine learning realization patterns and blocking of forum html files, and the method has universality in the face of forums realized by various methods, thereby avoiding the complexity caused by the need of designing different methods for extracting different forums.

Description

General forum text extraction algorithm based on block, pattern recognition and line text

Technical Field

The invention relates to forum text extraction, in particular to universal forum text extraction.

Background

Pattern recognition: pattern recognition refers to the process of processing and analyzing various forms of information (numerical, textual, and logical) that characterize a thing or phenomenon to describe, recognize, classify, and interpret the thing or phenomenon.

And (3) forum text extraction: the forum text extraction means that redundant parts in the web pages are removed, and only the most core contents in the forum are extracted, wherein the extraction includes personal information, text contents and content publishing time of the posters and the respondents. The existing forum text extraction technology can only extract a certain specific webpage.

Minimum edit distance: the purpose is to find out how many characters need to be changed between two strings and then become consistent. The method uses an algorithm strategy of dynamic programming, the problem has an optimal substructure, the minimum editing distance comprises a sub-minimum editing distance, and the following formula is provided:

d_[i，j]: represents a character string X [0, 1, 2, 3, i ]]And a character string X [0, 1, 2, 3, j ]]The distance between

x_i: represents a character string X [0, 1, 2, 3, i ]]The ith element in

x_j: represents a character string X [0, 1, 2, 3, i ]]The j (th) element of (1)

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provide a general forum extraction method based on block, mode recognition and line text. A pattern library is generated through matching of the processed html text, then the webpage is partitioned according to time, the text is extracted by using the patterns in the blocks, and meanwhile, the patterns can generate new patterns by using the existing patterns to achieve the effect of high accuracy, so that the algorithm has the characteristics of universality and self-learning. And partitioning the html files of the forum by using the time and the line text density, and extracting the text content of the partitions by using an html text matching generation mode. The patterns may be self-updated during the course of matching.

The method overcomes the defect that the traditional forum text extraction method has no universality or extremely low universality, and the universality of the method ensures that different methods do not need to be designed for extracting contents aiming at different forums. And the problem of low processing speed of the traditional method for extracting the text based on the dom tree is solved. The universality of forum text extraction is improved by 10-30%; the performance is improved by about 20 percent compared with the extraction of the traditional text.

Drawings

FIG. 1 is a flow chart of the method.

Detailed Description

The present invention will be described in further detail with reference to examples and drawings, but the present invention is not limited thereto.

As shown in fig. 1, the general forum extraction method based on block, pattern recognition and line text specifically includes the following steps:

the first step is as follows: and removing all webpage labels, script functions and blank lines by using the regular expression.

The second step is that: and continuously carrying out pattern matching according to the provided target data to find out the texts in front of and behind the target field, wherein the found front and back character strings are the pattern of the target field.

The third step: saving the mode and the target website url into a file.

The fourth step: and matching the input url with the shortest minimum editing distance in the pattern library according to the input url, and finding the pattern.

The fifth step: and partitioning according to the time and the line text density, extracting texts in forum webpages according to the mode, and finding dates.

And a sixth step: and judging the text content in each block, and judging whether the publication author is empty, if one of the two items is empty, determining that the two items are errors, and calculating the total error rate.

The seventh step: if the error rate is above the user-entered threshold, return to the pattern library to find the url for which the Hamming distance in the pattern library is only shorter than the last pattern. And returning to the fourth step. If the number of times of re-matching is equal to four, the eighth step is executed. And returning to the ninth step if the number of times of re-matching is less than four.

Eighth step: and sorting the error rates of the submodes of different target fields of the first four modes, and respectively selecting the submode with the lowest error rate to generate a new mode, wherein url is an input url. And returning to the fourth step.

The ninth step: and outputting the extracted text information to a file.

The generated pattern library can be stored in a file, and can be written by utilizing python, urllib, re, json Levenshtein and http libraries in one step according to an algorithm flow chart.

The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims

1. A general forum extraction method based on block, pattern recognition and line text is characterized in that: generating a pattern library through the matching of the processed html text, then blocking the webpage according to time and extracting the text in the blocks by using patterns;

the method comprises the following specific steps:

the first step is as follows: removing all webpage labels, script functions and blank lines by using a regular expression;

the second step is that: continuously performing mode matching according to the provided target data to find out the texts in front of and behind the target field, wherein the found front and back character strings are the modes of the target field;

the third step: saving the mode and the url of the target website into a file;

the fourth step: matching the input url with the shortest minimum editing distance in a pattern library according to the input url, and finding the pattern;

the fifth step: partitioning according to time and the line text density, extracting texts in forum webpages according to the modes, and finding dates;

and a sixth step: judging the text content in each block, and if the publication author is empty, if one of the two items is empty, the publication author is wrong, and calculating the total error rate;

the seventh step: if the error rate is higher than the threshold value input by the user, returning to the pattern library, finding the url of which the hamming distance in the pattern library is only shorter than the last pattern, and returning to the fourth step; if the number of times of re-matching is equal to four, executing the eighth step, and if the number of times of re-matching is less than four, returning to the ninth step;

eighth step: sorting the error rates of the submodes of different target fields of the first four modes, respectively selecting the submode with the lowest error rate to generate a new mode, wherein url is an input url, and returning to the fourth step;

the ninth step: and outputting the extracted text information to a file.

2. The method of claim 1, wherein: and the method provides the method for blocking the html file of the forum by using time and the line text density to block the webpage.

3. The method of claim 1, wherein: and extracting the segmented text content by using an html text matching generation mode.

4. The method of claim 1, wherein: the patterns may be self-updated during the matching process.