CN110020312B

CN110020312B - Method and device for extracting webpage text

Info

Publication number: CN110020312B
Application number: CN201711306108.3A
Authority: CN
Inventors: 贾宝玉; 李�杰; 周旭
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Current assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Priority date: 2017-12-11
Filing date: 2017-12-11
Publication date: 2022-09-06
Anticipated expiration: 2037-12-11
Also published as: CN110020312A

Abstract

The invention discloses a method and a device for extracting a webpage text, and relates to the technical field of computers. One embodiment of the method comprises: constructing an access model according to a webpage to be extracted; calculating similarity values of each unit region of the main body part and the characteristic part; screening a unit text area from the access model according to the similarity value and the first index value of each unit area; and determining the beginning and the end of the text of the webpage to be extracted according to the unit text area so as to obtain the complete text of the webpage to be extracted. The method and the device can accurately and completely extract the webpage text, reduce the labor cost and improve the efficiency of extracting the webpage text.

Description

Method and device for extracting webpage text

Technical Field

The invention relates to the technical field of computers, in particular to a method and a device for extracting webpage texts.

Background

With the rapid development of society, the internet gradually becomes a main platform for information publishing and acquisition, and data on the internet is increased in geometric progression all the time. Internet data has covered various fields of the real world such as economy, politics, culture and the like, and constitutes an important information source for many applications. However, the contents of the web page include contents irrelevant to the text, such as copyright information, advertisements, navigation bars, decoration information, etc., in addition to the text required by people, and are called noise information. How to shield noise information and extract texts from web pages has become a hot spot of current research.

The current methods for extracting the text of the webpage have the following three categories: firstly, a webpage text extracting method based on a template; secondly, extracting a text based on the density of the block text; and thirdly, segmenting and extracting the text based on the visual webpage. In the method for extracting the webpage text based on the template, the template information needs to be manually maintained, and then the text content is extracted according to the template information; in the method for extracting the text based on the block text density, a line block distribution function is obtained according to the inline text ratio of each line, and then line blocks with high text ratio exceeding a threshold value are calculated, so that the text content is determined; in the method for extracting the text based on the visual webpage segmentation, firstly, the webpage is segmented into a plurality of page blocks according to visual information, and then the page blocks are combined by using separation lines in HTML labels, so that the text of the webpage is obtained.

In the process of implementing the invention, the inventor finds that at least the following problems exist in the prior art: firstly, extracting a webpage text based on a template, needing manual participation, having large workload, and needing to reconfigure the template when the webpage structure changes; secondly, extracting the text based on the block text density, so that the beginning and the end of the text are difficult to determine, and the integrity rate is not high; thirdly, engines such as javascript are needed in the method for extracting the text based on the visual webpage segmentation, so that the complexity is high and time is consumed; fourth, none of the prior art methods is suitable for all types of web page text extraction.

Disclosure of Invention

In view of this, embodiments of the present invention provide a method and an apparatus for extracting a web page text, which can accurately and completely extract the web page text, reduce labor cost, and improve efficiency of extracting the web page text.

To achieve the above object, according to an aspect of the embodiments of the present invention, a method for extracting a text of a web page is provided.

The method for extracting the webpage text comprises the following steps: constructing an access model according to a webpage to be extracted, wherein the access model comprises: a feature portion and a body portion; calculating similarity values of each unit region of the main body part and the characteristic part; screening unit text areas from the access model according to the similar values and the first index values of the unit areas; and determining the beginning and the end of the text of the webpage to be extracted according to the unit text area so as to obtain the complete text of the webpage to be extracted.

Optionally, before constructing the access model according to the web page to be extracted, the method further includes: and carrying out standardization processing on the source code of the webpage to be extracted.

Optionally, the calculating the similarity value between each unit region of the body portion and the feature portion includes: calculating a second index value of the feature portion and a second index value of each unit area of the body portion; and calculating a similarity value between the feature portion and each unit region by using a second index value of the feature portion and a second index value of each unit region.

Optionally, the screening of the unit text area from the access model according to the similarity value and the first index value of each unit area includes: according to the first index value, selecting a suspected text area from the access model; and screening the unit text area from the suspected text area by using the similarity value.

Optionally, the screening the unit body area from the suspected text area using the similarity value includes: and comparing the similarity value of each unit area in the suspected text area, and selecting the unit area with the maximum similarity value as a unit text area.

Optionally, determining the beginning and the end of the text of the webpage to be extracted according to the unit text area includes: and performing iterative traversal of the upward and downward unit areas by taking the unit text area as a center, judging whether each unit area meets a preset text condition, and stopping iteration if each unit area does not meet the preset text condition, so as to determine the beginning and the end of the text of the webpage to be extracted.

Optionally, the determining whether each unit area meets the preset text condition includes: judging whether the similarity value of each unit area is greater than a preset similarity threshold value or not, and if so, determining that the unit area meets a preset text condition; and/or judging whether the link ratio of each unit area is smaller than a preset link ratio threshold value, and if so, determining that the unit area meets a preset text condition; and/or judging whether the symbol ratio of each unit area is greater than a preset symbol ratio threshold value, and if so, determining that the unit area meets a preset text condition.

Optionally, after determining the beginning and the end of the text of the web page to be extracted according to the unit text area, the method further includes: acquiring text additional information of the webpage to be extracted, wherein the text additional information comprises at least one of the following information: text title, author, date and source.

Optionally, the access model is a text object model.

Optionally, the unit areas are in units of rows.

Optionally, the first index value is used to represent attribute information of each unit region, and includes: unit density of each unit area.

Optionally, the second index value is used to represent attribute information of an area in the web page, and includes: a characteristic vector value.

To achieve the above object, according to another aspect of the embodiments of the present invention, there is provided an apparatus for extracting text of a web page.

The device for extracting the webpage text of the embodiment of the invention comprises the following components: the construction module is used for constructing an access model according to the webpage to be extracted, and the access model comprises: a feature portion and a body portion; a calculation module for calculating a similarity value between each unit region of the main body portion and the feature portion; the screening module is used for screening the unit text area from the access model according to the similarity value and the first index value of each unit area; and the determining module is used for determining the beginning and the end of the text of the webpage to be extracted according to the unit text area so as to obtain the complete text of the webpage to be extracted.

Optionally, the building module is further configured to: before an access model is built according to a webpage to be extracted, the source code of the webpage to be extracted is subjected to standardization processing.

Optionally, the computing module is further configured to: calculating a second index value of the feature portion and a second index value of each unit area of the body portion; and calculating a similarity value between the feature portion and each unit region by using a second index value of the feature portion and a second index value of each unit region.

Optionally, the screening module is further configured to: selecting a suspected text area from the access model according to the first index value; and screening the unit text area from the suspected text area by using the similarity value.

Optionally, the screening module is further configured to: and comparing the similarity value of each unit area in the suspected text area, and selecting the unit area with the maximum similarity value as a unit text area.

Optionally, the determining module is further configured to: and performing iterative traversal of the upward and downward unit areas by taking the unit text area as a center, judging whether each unit area meets a preset text condition, and stopping iteration if the unit area does not meet the preset text condition, so as to determine the beginning and the end of the text of the webpage to be extracted.

Optionally, the determining module is further configured to: judging whether the similarity value of each unit area is greater than a preset similarity threshold value or not, and if so, determining that the unit area meets a preset text condition; and/or judging whether the link ratio of each unit area is smaller than a preset link ratio threshold value, and if so, determining that the unit area meets a preset text condition; and/or judging whether the symbol ratio of each unit area is greater than a preset symbol ratio threshold value, and if so, determining that the unit area meets a preset text condition.

Optionally, the determining module is further configured to: acquiring text additional information of the webpage to be extracted, wherein the text additional information comprises at least one of the following information: text title, author, date and source.

Optionally, the access model is a text object model.

Optionally, the unit areas are in units of rows.

Optionally, the second index value is used to represent attribute information of a certain area in the web page, and includes: a characteristic vector value.

To achieve the above object, according to still another aspect of embodiments of the present invention, there is provided an electronic device.

An electronic device according to an embodiment of the present invention includes: one or more processors; the storage device is used for storing one or more programs, and when the one or more programs are executed by the one or more processors, the one or more processors implement the method for extracting the body of the webpage in the embodiment of the invention.

To achieve the above object, according to still another aspect of an embodiment of the present invention, there is provided a computer-readable medium.

A computer readable medium of an embodiment of the present invention stores thereon a computer program, and when the computer program is executed by a processor, the computer program implements the method for extracting a text of a web page of an embodiment of the present invention.

One embodiment of the above invention has the following advantages or benefits: the beginning and the end of the webpage text can be determined, so that the complete text of the webpage can be intelligently extracted, the labor cost is reduced, and the efficiency of extracting the webpage text is improved; the source codes of the webpage to be extracted are subjected to standardization processing, so that an access model can be constructed according to the standardized source codes, the time for extracting the text of the webpage is reduced, and the method provided by the embodiment of the invention can be suitable for text extraction of various types of webpages; in the embodiment of the invention, the second index value of the characteristic part and the second index value of each unit area of the main body part are calculated, so that the similarity value of the characteristic part and each unit area can be conveniently calculated by using the second index value; in the embodiment of the invention, the suspected text area is selected by the first index value of each unit area, so that the text selection range can be reduced, and the extraction efficiency of the webpage text is improved; in the embodiment of the invention, the unit area with the maximum similarity value can be used as the unit text area by comparing the similarity value of each unit area in the suspected text area, so that the accuracy rate of text extraction is improved; in the embodiment of the invention, the unit text area is taken as the center to carry out the iterative traversal of the upward and downward unit areas, so that the beginning and the end of the text can be determined, and the complete text of the webpage can be extracted; in the embodiment of the invention, whether each unit area meets the preset text condition or not is judged from a plurality of angles such as the similarity value, the link ratio and/or the sign ratio, so that the accuracy of text extraction can be further improved; in the embodiment of the invention, the text additional information of the webpage to be extracted is acquired, so that the integrity of the text is improved; in the embodiment of the present invention, the first index value may include a unit density of each unit area, so that the suspected text area may be selected by using the attribute information of the unit density; the second index value in the embodiment of the present invention may include a feature vector value, so that a similarity value may be calculated by means of the feature vector value.

Further effects of the above-mentioned non-conventional alternatives will be described below in connection with the embodiments.

Drawings

The drawings are included to provide a better understanding of the invention and are not to be construed as unduly limiting the invention. Wherein:

FIG. 1 is a diagram illustrating the main steps of a method for extracting the text of a web page according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a main flow of a method for extracting a text of a web page according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a normalized source code and its corresponding dom tree;

FIG. 4 is a schematic diagram of the main steps of calculating the similarity value between each line of text and the feature information according to the method for extracting the body of the web page in the embodiment of the invention;

FIG. 5 is a diagram illustrating the main steps of filtering out text lines in the method for extracting the text of a web page according to the embodiment of the invention;

FIG. 6 is a diagram illustrating an obtained line density function of a method for extracting text of a web page according to an embodiment of the present invention;

FIG. 7 is a diagram illustrating the main steps of determining the beginning and end of the body of a web page according to the method for extracting the body of a web page of the present invention;

FIG. 8 is a diagram illustrating the main modules of an apparatus for extracting text from a web page according to an embodiment of the present invention;

FIG. 9 is an exemplary system architecture diagram in which embodiments of the present invention may be employed;

fig. 10 is a schematic block diagram of a computer system suitable for use in implementing a terminal device or server according to an embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present invention are described below with reference to the accompanying drawings, in which various details of embodiments of the invention are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

The method for intelligently extracting the webpage text is designed based on various characteristics of the current webpage text and combined with the advantages and disadvantages of the prior art, the webpage text can be accurately and completely extracted, the labor cost is reduced, and the efficiency of extracting the webpage text is improved. The characteristics of the webpage text can include: the method comprises the following steps of long sentences of the text, more sentences of the text, certain correlation between the title and the text, the middle position of the text in a webpage, small ratio of hyperlinks in the text and more punctuation marks in the text than other modules.

Fig. 1 is a schematic diagram of main steps of a method for extracting a text of a web page according to an embodiment of the present invention, and as shown in fig. 1, the method for extracting a text of a web page according to an embodiment of the present invention mainly includes the following steps:

step S101: and constructing an access model according to the webpage to be extracted. In the present invention, accessing the model may include: a feature portion and a body portion. The feature section may store feature information of the web page, such as information of title, keyword, and summary. The body part may store text information of the web page.

Step S102: the similarity value between each unit region of the main body portion and the feature portion is calculated. In the embodiment of the present invention, the main body portion of the access model may be divided into a plurality of unit regions, and then the similarity value between each unit region and the feature portion may be calculated.

Step S103: and screening the text area of the unit from the access model according to the similar value and the first index value of each unit area. In the embodiment of the present invention, the similarity value between each unit region and the feature portion is obtained in step S102, and whether each unit region is a unit text region is determined by combining the first index value of each unit region.

Step S104: and determining the beginning and the end of the text of the webpage to be extracted according to the unit text area so as to obtain the complete text of the webpage to be extracted.

In the embodiment of the present invention, before constructing the access model according to the webpage to be extracted, the method for extracting the text of the webpage may further include: and carrying out standardization processing on the source code of the webpage to be extracted. In the embodiment of the present invention, the normalization process includes: and removing the script language and the special character conversion. In order to meet the visual experience of a user, a large amount of JS (JavaScript, which is a scripting language belonging to the network and used for adding various dynamic functions to a webpage and providing a smoother and attractive browsing effect for the user) and CSS (Cascade Style Sheets, a computer language with a file Style and used for statically modifying the webpage and dynamically formatting elements of the webpage in cooperation with various scripting languages) can be embedded into a webpage source code, the scripting languages are used for modifying the webpage and are irrelevant to the content of the webpage, and the scripting languages bring great interference to text extraction, so that the scripting languages irrelevant to the text can be removed. In addition, for subsequent processing, special characters in the source code can be converted into conventional forms, such as converting & lt to <, & gt to > and the like.

In an embodiment of the present invention, calculating the similarity value between each unit region of the main body part and the feature part may include: calculating a second index value of the characteristic part and a second index value of each unit area of the main body part; and calculating the similarity value between the characteristic part and each unit area by using the second index value of the characteristic part and the second index value of each unit area. In the embodiment of the present invention, the feature portion may store feature information of the web page, for example, information such as a title, a keyword, and an abstract, and therefore, a second index value of the feature portion may be generated as a second index model value according to the feature information. Then, the similarity value between the feature portion and each unit region is calculated using the second index model value and the second index value of each unit region. In the embodiment of the present invention, a cosine value of the second index model value and the second index value of each unit region may be calculated as a similarity value between the feature portion and each unit region through a cosine theorem formula, where the cosine value is closer to 1 to indicate that the similarity is higher. Of course, in the embodiment of the present invention, similar values of the characteristic portions and the unit regions may be obtained through other algorithms, which is not limited to this.

In an embodiment of the present invention, the screening a unit text region from the access model according to the similarity value and the first index value of each unit region may include: selecting a suspected text area from the access model according to the first index value; and screening the unit text area from the suspected text area by using the similarity value. The number of the suspected text areas can be one or more, and the unit text area can be one or more unit areas.

In the embodiment of the present invention, the screening of the unit text region from the suspected text region by using the similarity value may include: and comparing the similarity value of each unit area in the suspected text area, and selecting the unit area with the maximum similarity value as a unit text area. If there are multiple unit areas with the largest similarity value in the suspected text areas, the multiple unit areas with the largest similarity value may all be unit text areas, or any one of the multiple unit areas with the largest similarity value may be selected as a unit text area, or may be selected by other methods, which is not limited by the present invention.

In the embodiment of the present invention, determining the beginning and the end of the text of the web page to be extracted according to the unit text area may include: and performing iterative traversal of the upward and downward unit areas by taking the unit text area as a center, judging whether each unit area meets a preset text condition, and stopping iteration if the unit area does not meet the preset text condition, so as to determine the beginning and the end of the text of the webpage to be extracted. After the unit body area is screened out in step S103, an iterative traversal of the unit body area upward is then performed with the unit body area as the center. Firstly, judging whether the unit text area which is upward is a preset condition of the symbol, if the preset condition is met, indicating that the unit text area belongs to the webpage text, and continuing upward iteration, and if the preset condition is not met, indicating that the unit text area does not belong to the webpage text, namely determining the beginning of the webpage. Similarly, the same method can be adopted to perform iterative traversal of the downward unit area by taking the unit text area as the center, and determine the end of the web page text.

In the embodiment of the present invention, determining whether each unit area meets the preset text condition may include: judging whether the similarity value of each unit area is greater than a preset similarity threshold value or not, and if so, determining that the unit area meets a preset text condition; and/or judging whether the link ratio of each unit area is smaller than a preset link ratio threshold value, and if so, determining that the unit area meets a preset text condition; and/or judging whether the symbol ratio of each unit area is greater than a preset symbol ratio threshold value, and if so, determining that the unit area meets a preset text condition. The preset similarity threshold may be obtained by calculating an arithmetic mean of similarity values of each unit region, or may be obtained by calculating by other methods. The link ratio may be a ratio of the number of links to the number of characters, and the sign ratio may be a ratio of the number of signs to the number of characters.

In the embodiment of the present invention, after determining the beginning and the end of the text of the webpage to be extracted according to the unit text area, the method for extracting the text of the webpage may further include: and acquiring the text additional information of the webpage to be extracted. The text additional information may include at least one of the following: text title, author, date and source. In the embodiment of the invention, the text title can be searched from the main body part through the characteristic information of the characteristic part. In the embodiment of the invention, after the text title and the text position are determined, the information such as author, date and source can be extracted through the regular expression (also called regular expression, a concept of computer science, which is usually used for retrieving and replacing the text conforming to a certain rule). The date of the text is generally located between the title and the text content and stored in a regular mode, so that the regular expression can be adopted for extraction. The information such as the source and author of the text is generally positioned in the middle of the title and the text content or below the text and stored in a regular mode, so that the information can be extracted by adopting a regular expression.

In this embodiment of the present invention, the access Model may be a text Object Model, such as a DOM (Document Object Model, DOM for short, which is a standard programming interface for processing extensible markup language recommended by the world wide web organization) tree.

In the embodiment of the present invention, each unit area may be in a unit of a row. Of course, other units may be selected in the embodiment of the present invention.

In an embodiment of the present invention, the first index value is used to represent attribute information of each unit region, and may include: unit density of each unit area. For convenience of understanding, the unit density is calculated by taking the unit area as a row, and the "unit density" is taken as the "row density" for detailed description, but the "row" is not used to limit the protection scope of the technical solution of the present invention, and the "row density" in the present invention may be adaptively adjusted according to a specific service scenario. In the embodiment of the present invention, the unit density may be obtained by the following calculation method. First, a line block of each line is obtained, and it is described as a line 1, where k is set according to specific situations and k is 3, the line block of the line 1 is "text of the line 1 to the line 4". Then, the length of each line block is calculated, taking the line block of the 1 st line as an example, after removing the blank characters of the line block of the 1 st line, the total number of characters of the line block of the 1 st line is counted, and then added (the punctuation mark number k of the 1 st line). Considering that the text in the web page has punctuation marks and no punctuation marks elsewhere, (the number of punctuation marks k) is equivalent to a weight. Finally, the row density of each row is obtained as: row block length/(k + 1). In the present invention, other methods may be selected to calculate the unit density, and this is not limited.

In this embodiment of the present invention, the second index value is used to represent attribute information of a certain area in a web page, and may include: a characteristic vector value. In the present invention, the similarity value between the feature portion and each unit region can be calculated by using the feature vector value of the feature portion and the feature vector value of each unit region.

For convenience of understanding, fig. 2 to 7 describe the embodiment of the present invention in units of rows, and the "access model" is taken as a "dom tree", the "first index value" is taken as a "row density", and the "second index value" is taken as a "feature vector value" for detailed description, but it is needless to say that the "units of rows" are not used to limit the protection range of the technical solution of the present invention, and the "dom tree", "row density", and "feature vector value" in the present invention may be adaptively adjusted according to a specific service scenario.

Fig. 2 is a schematic diagram of a main flow of a method for extracting a text of a web page according to an embodiment of the present invention, and as shown in fig. 2, the method for extracting a text of a web page according to an embodiment of the present invention mainly includes the following flows: step S201, loading a source code of a webpage to be extracted, and carrying out standardization processing on the source code; step S202, constructing a text dom tree according to the standardized source codes; step S203, extracting the characteristic information of the webpage according to the dom tree, and determining the title information of the webpage text; step S204, calculating similarity values of texts in each row and the characteristic information and row density of the texts in each row; s205, selecting a suspected text block according to the similarity value and the line density, and then selecting a text line from the suspected text block; step S206, performing upward and downward iterative traversal on the text line, and determining the beginning and the end of the text; step S207, determining additional information of the body.

Step S201 is to load the source code of the webpage to be extracted, and perform standardization processing on the source code, and the specific process may include: loading a source code of a webpage to be extracted by means of Jsoup (software package for analyzing webpage content); analyzing the source code, and converting the format of the loaded source code; removing scripting languages such as JS and CSS; and processing the special characters.

Step S202 is to construct a text dom tree from the standardized source codes. Fig. 3 is a schematic diagram of a normalized source code and its corresponding dom tree. In the invention, a dom tree can be constructed by means of Jsoup, then the dom tree is stored in a mode that text information corresponds to node label groups to form a text list, each line is processed as an object, one line is a text and corresponds to one label, and simultaneously the sequence of the line in a page, the number of links of the line, the number of label points and the number of characters are all stored in the text list. The "format of node tag group corresponding to text information" corresponding to the dom tree of fig. 3 may be as follows:

“HTML Tree”：html→head→title→text；

"hello! ": html → body → table → tr → td → text;

"this is an HTML tree. ": html → body → table → tr → td → text. Wherein the text information "hello! "and" this is an HTML tree. "the corresponding node label groups are identical.

And step S203, extracting the characteristic information of the webpage according to the dom tree and determining the title information of the webpage text. The feature information and the subject information of the web page are displayed by the dom tree, the head label of the dom tree corresponds to the feature information of the web page, such as title content, keywords and abstract, and the text information of the web page corresponds to the body label. Text information such as title content, keywords, and summaries is extracted from the dom tree using tags such as html → head → title → text. From the extracted title content, the position of the title content in the body tag can be found.

In step S204, the similarity value between each line of text and the feature information and the line density of each line of text are calculated. In step S203, feature information of the web page, i.e., information such as title content, keywords, and abstract, is obtained, and the text of the web page has a certain correlation with the information.

Fig. 4 is a schematic diagram of the main steps of calculating the similarity value between each line of text and the feature information according to the method for extracting the body of the web page in the embodiment of the present invention. As shown in fig. 4, the main steps of calculating the similarity value between each line of text and the feature information may include: step S401, the stop word and the word segmentation processing are carried out on the characteristic information to obtain nThe method comprises the following steps of (1) counting feature words, and counting word frequency of the feature words, wherein stop words refer to that certain characters or words can be automatically filtered before or after natural language data is processed in information retrieval in order to save storage space and improve search efficiency; step S402, calculating TF-IDF value of each feature word according to TF-IDF algorithm, wherein TF-IDF, i.e. term frequency-inverse document frequency, is a common weighting technology for information retrieval and data mining, TF-IDF value of a word can be calculated according to TF-IDF algorithm, and TF-IDF value of a word is larger when the importance of a certain word to an article is higher; in step S403, a set of feature vectors is obtained as model feature vector values D ═ D (W) of the web page to be extracted ₁ ,W ₂ ,…,W _n ) Wherein, W ₁ The word frequency of the 1 st feature word and the TF-IDF value of the 1 st feature word; step S404, traversing each line of text, performing word segmentation, and calculating a characteristic vector value of each line; step S405, calculating a cosine value of the feature vector value of each line and the model feature vector value as a similarity value of the text and the feature information of each line, wherein the cosine theorem formula is expressed as:

wherein Sim (D, D) _i ) Representing the similarity value of the i-th line of text with the feature vector, D _i ＝D(W _i1 ,W _i2 ,…,W _in ) Representing the characteristic vector value of the ith row.

Step S205 is to select a suspected text block according to the similarity value and the line density, and then select a text line from the suspected text block. Fig. 5 is a schematic diagram of main steps of screening out text lines in the method for extracting the text of the webpage according to the embodiment of the invention. As shown in fig. 5, the main steps of screening out text lines may include: step S501, obtaining a line density function according to line density of texts in each line; step S502, a suspected text block is obtained through a sudden rising and falling area of a line density function; step S503, traverse the suspected text block to find out the text line with the largest similarity as the text line.

Fig. 6 is a schematic diagram of an acquired line density function of a method for extracting a body of a web page according to an embodiment of the present invention. In fig. 6, the horizontal axis represents the line number of each line, and the vertical axis represents the line density of each line. The positions of the blocks of the suspected text are obtained through the sudden rising and falling of the line density function. For example, the horizontal axis is X1 … … Xn, the vertical axis is Y (X1) … … Y (Xn), and the start position Xstart and the end position xent of the text need to be determined, and the algorithm for determining the suspected text block may specifically be as follows:

(1) determining a snap point Xstart (Y (Xstart) -Y (X (start-1)) > Y (xt) > 30%), where Y (xt) is the maximum value of the line density;

(2) in order to avoid noise, there is Y (X (start +1)) ≠ 0;

(3) y (xend) ═ 0, i.e., the dip point is 0, indicating end;

(4) eighty percent of the maximum line density between Xstart and Xend is guaranteed, i.e. y (xt) 80%.

Through the above algorithm, lines 49 to 73 and lines 91 to 97 in fig. 6 can be obtained as the suspected text blocks. Of course, in the embodiment of the present invention, other methods may be selected to obtain the suspected text block, which is not limited in the present invention.

Step S206 is to perform an iterative traversal of the text lines up and down to determine the beginning and end of the text. Fig. 7 is a schematic diagram of main steps of determining the beginning and the end of the text in the method for extracting the text of the web page according to the embodiment of the invention. As shown in fig. 7, the main steps of determining the beginning and end of a body according to an embodiment of the present invention may include: step S701, a node tag group corresponding to a text line can be determined through the text line; step S702, determining the position of a node tag group on a dom tree, and extracting a text from the node tag group; step S703, taking the node label group as a center to perform upward and downward iterative traversal; step S704, judging whether the similarity value of each line is larger than a preset similarity threshold value; step S705, if the value is larger than the preset value, extracting a text from the node tag group corresponding to the text line and continuing iteration; and step S706, if the content is not larger than the preset value, stopping iteration, and determining the beginning and the end of the text of the webpage to be extracted.

In the embodiment of the present invention, whether the line meets the preset text condition is determined by comparing the similarity value of each line with the similarity threshold, and of course, whether the line meets the preset text condition may also be determined by using the link ratio of each line or the sign ratio of each line in the present invention.

Step S207 is to determine additional information of the body text. Wherein the additional information may include: author, date and source. In the steps, the positions of the title content and the text in the dom tree are found, so that information such as authors, dates and sources can be extracted through the regular expressions.

According to the technical scheme for extracting the webpage text, the beginning and the end of the webpage text can be determined, so that the complete text of the webpage can be intelligently extracted, the labor cost is reduced, and the efficiency of extracting the webpage text is improved; according to the embodiment of the invention, the source codes of the webpage to be extracted are subjected to standardization processing, so that an access model can be constructed according to the standardized source codes, the time for extracting the text of the webpage is reduced, and the method provided by the embodiment of the invention can be suitable for text extraction of various types of webpages; in the embodiment of the invention, the second index value of the characteristic part and the second index value of each unit area of the main body part are calculated, so that the similarity value of the characteristic part and each unit area can be conveniently calculated by using the second index value; in the embodiment of the invention, the suspected text area is selected by the first index value of each unit area, so that the text selection range can be reduced, and the extraction efficiency of the webpage text is improved; in the embodiment of the invention, the unit area with the maximum similarity value can be used as the unit text area by comparing the similarity value of each unit area in the suspected text area, so that the accuracy rate of text extraction is improved; in the embodiment of the invention, the unit text area is taken as the center to carry out the iterative traversal of the upward and downward unit areas, so that the beginning and the end of the text can be determined, and the complete text of the webpage can be extracted; in the embodiment of the invention, whether each unit area meets the preset text condition or not is judged from a plurality of angles such as the similarity value, the link ratio and/or the sign ratio, so that the accuracy of text extraction can be further improved; in the embodiment of the invention, the text additional information of the webpage to be extracted is acquired, so that the integrity of the text is improved; in the embodiment of the present invention, the first index value may include a unit density of each unit area, so that the suspected text area may be selected by using the attribute information of the unit density; the second index value in the embodiment of the present invention may include a feature vector value, so that a similarity value may be calculated by means of the feature vector value.

Fig. 8 is a schematic diagram of main blocks of an apparatus for extracting text of a web page according to an embodiment of the present invention. As shown in fig. 8, the apparatus 800 for extracting text of a web page of the present invention mainly includes the following modules: a construction module 801, a calculation module 802, a screening module 803, and a determination module 804.

Wherein, the building block 801 is operable to: and constructing an access model according to the webpage to be extracted. The access model may include: a feature portion and a body portion. The calculation module 802 may be configured to: similarity values of each unit region of the body portion and the feature portion are calculated. The screening module 803 may be used to: and screening the text area of the unit from the access model according to the similar value and the first index value of each unit area. The determination module 804 may be configured to: and determining the beginning and the end of the text of the webpage to be extracted according to the unit text area so as to obtain the complete text of the webpage to be extracted.

In this embodiment of the present invention, the building module 801 may further be configured to: before constructing an access model according to a webpage to be extracted, standardizing the source code of the webpage to be extracted.

In this embodiment of the present invention, the calculating module 802 may further be configured to: calculating a second index value of the characteristic portion and a second index value of each unit area of the main body portion; and calculating the similarity value between the characteristic part and each unit area by using the second index value of the characteristic part and the second index value of each unit area.

In this embodiment of the present invention, the screening module 803 may further be configured to: selecting a suspected text area from the access model according to the first index value; and screening the unit text area from the suspected text area by using the similarity value.

In this embodiment of the present invention, the screening module 803 may further be configured to: and comparing the similarity value of each unit area in the suspected text area, and selecting the unit area with the maximum similarity value as a unit text area.

In this embodiment of the present invention, the determining module 804 may further be configured to: and performing iterative traversal of the upward and downward unit areas by taking the unit text area as a center, judging whether each unit area meets a preset text condition, and stopping iteration if the unit area does not meet the preset text condition, so as to determine the beginning and the end of the text of the webpage to be extracted.

In this embodiment of the present invention, the determining module 804 may further be configured to: judging whether the similarity value of each unit area is greater than a preset similarity threshold value or not, and if so, determining that the unit area meets a preset text condition; and/or judging whether the link ratio of each unit area is smaller than a preset link ratio threshold value, and if so, determining that the unit area meets a preset text condition; and/or judging whether the symbol ratio of each unit area is greater than a preset symbol ratio threshold value, and if so, determining that the unit area meets a preset text condition.

In this embodiment of the present invention, the determining module 804 may be further configured to: and acquiring the text additional information of the webpage to be extracted. The text additional information may include at least one of the following: text title, author, date and source.

In the embodiment of the invention, the access model can be a text object model.

In the embodiment of the present invention, each unit area may be in a unit of a row.

In this embodiment of the present invention, the first index value may be used to represent attribute information of each unit area, and includes: unit density of each unit area.

In this embodiment of the present invention, the second index value may be used to represent attribute information of a certain area in a web page, and includes: a characteristic vector value.

As can be seen from the above description, the beginning and the end of the text of the web page can be determined, so that the complete text of the web page can be intelligently extracted, the labor cost is reduced, and the efficiency of extracting the text of the web page is improved; according to the embodiment of the invention, the source codes of the webpage to be extracted are subjected to standardization processing, so that an access model can be constructed according to the standardized source codes, the time for extracting the text of the webpage is reduced, and the method provided by the embodiment of the invention can be suitable for text extraction of various types of webpages; in the embodiment of the invention, the second index value of the characteristic part and the second index value of each unit area of the main body part are calculated, so that the similarity value of the characteristic part and each unit area can be conveniently calculated by using the second index value; in the embodiment of the invention, the suspected text area is selected by the first index value of each unit area, so that the text selection range can be reduced, and the extraction efficiency of the webpage text is improved; in the embodiment of the invention, the unit area with the maximum similarity value can be used as the unit text area by comparing the similarity value of each unit area in the suspected text area, so that the accuracy rate of text extraction is improved; in the embodiment of the invention, the unit text area is taken as the center to carry out the iterative traversal of the upward and downward unit areas, so that the beginning and the end of the text can be determined, and the complete text of the webpage can be extracted; in the embodiment of the invention, whether each unit area meets the preset text condition or not is judged from a plurality of angles such as the similarity value, the link ratio and/or the sign ratio, so that the accuracy of text extraction can be further improved; in the embodiment of the invention, the text additional information of the webpage to be extracted is acquired, so that the integrity of the text is improved; in the embodiment of the present invention, the first index value may include a unit density of each unit area, so that the suspected text area may be selected by using the attribute information of the unit density; the second index value in the embodiment of the present invention may include a feature vector value, so that a similarity value may be calculated by means of the feature vector value.

Fig. 9 shows an exemplary system architecture 900 of a method for extracting a body of a web page or an apparatus for extracting a body of a web page to which an embodiment of the present invention may be applied.

As shown in fig. 9, the system architecture 900 may include

end devices

901, 902, 903, a network 904, and a server 905. Network 904 is the medium used to provide communication links between

terminal devices

901, 902, 903 and server 905. Network 904 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

A user may use the

terminal devices

901, 902, 903 to interact with a server 905 over a network 904 to receive or send messages and the like. The

terminal devices

901, 902, 903 may have installed thereon various messenger client applications such as, for example only, a shopping-like application, a web browser application, a search-like application, an instant messaging tool, a mailbox client, social platform software, etc.

The

terminal devices

901, 902, 903 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.

The server 905 may be a server providing various services, such as a background management server (for example only) providing support for shopping websites browsed by users using the

terminal devices

901, 902, 903. The backend management server may analyze and perform other processing on the received data such as the product information query request, and feed back a processing result (for example, target push information, product information — just an example) to the terminal device.

It should be noted that the method for extracting the text of the web page provided by the embodiment of the present invention is generally executed by the server 905, and accordingly, the apparatus for extracting the text of the web page is generally disposed in the server 905.

It should be understood that the number of terminal devices, networks, and servers in fig. 9 are merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for an implementation.

Referring now to FIG. 10, shown is a block diagram of a computer system 1000 suitable for use with a terminal device implementing embodiments of the present invention. The terminal device shown in fig. 10 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.

As shown in fig. 10, the computer system 1000 includes a Central Processing Unit (CPU)1001 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)1002 or a program loaded from a storage section 1008 into a Random Access Memory (RAM) 1003. In the RAM 1003, various programs and data necessary for the operation of the system 1000 are also stored. The CPU 1001, ROM 1002, and RAM 1003 are connected to each other via a bus 1004. An input/output (I/O) interface 1005 is also connected to bus 1004.

The following components are connected to the I/O interface 1005: an input section 1006 including a keyboard, a mouse, and the like; an output portion 1007 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage portion 1008 including a hard disk and the like; and a communication section 1009 including a network interface card such as a LAN card, a modem, or the like. The communication section 1009 performs communication processing via a network such as the internet. The driver 1010 is also connected to the I/O interface 1005 as necessary. A removable medium 1011 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 1010 as necessary, so that a computer program read out therefrom is mounted into the storage section 1008 as necessary.

In particular, according to embodiments of the present disclosure, the processes described above with reference to the flow diagrams may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer-readable medium, the computer program comprising program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication part 1009 and/or installed from the removable medium 1011. The computer program executes the above-described functions defined in the system of the present invention when executed by the Central Processing Unit (CPU) 1001.

It should be noted that the computer readable medium shown in the present invention can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules described in the embodiments of the present invention may be implemented by software or hardware. The described modules may also be provided in a processor, which may be described as: a processor includes a construction module, a calculation module, a screening module, and a determination module. The names of these modules do not in some cases constitute a definition of the module itself, and for example, a building module may also be described as a "module for building an access model from a web page to be extracted".

As another aspect, the present invention also provides a computer-readable medium, which may be contained in the apparatus described in the above embodiments; or may be separate and not incorporated into the device. The computer readable medium carries one or more programs which, when executed by a device, cause the device to comprise: constructing an access model according to a webpage to be extracted; calculating similarity values of each unit region of the main body part and the characteristic part; screening a unit text area from the access model according to the similar value and the first index value of each unit area; and determining the beginning and the end of the text of the webpage to be extracted according to the unit text area so as to obtain the complete text of the webpage to be extracted.

According to the technical scheme of the embodiment of the invention, the beginning and the end of the webpage text can be determined, so that the complete text of the webpage can be intelligently extracted, the labor cost is reduced, and the efficiency of extracting the webpage text is improved; according to the embodiment of the invention, the source codes of the webpage to be extracted are subjected to standardization processing, so that an access model can be constructed according to the standardized source codes, the time for extracting the text of the webpage is reduced, and the method provided by the embodiment of the invention can be suitable for text extraction of various types of webpages; in the embodiment of the invention, the second index value of the characteristic part and the second index value of each unit area of the main body part are calculated, so that the similarity value of the characteristic part and each unit area can be conveniently calculated by using the second index value; in the embodiment of the invention, the suspected text area is selected through the first index value of each unit area, so that the text selection range can be reduced, and the extraction efficiency of the webpage text is improved; in the embodiment of the invention, the unit area with the maximum similarity value can be used as the unit text area by comparing the similarity value of each unit area in the suspected text area, so that the accuracy rate of text extraction is improved; in the embodiment of the invention, the unit text area is used as the center to carry out the iterative traversal of the upward and downward unit areas, so that the beginning and the end of the text can be determined, and the complete text of the webpage can be extracted; in the embodiment of the invention, whether each unit area meets the preset text condition or not is judged from a plurality of angles such as the similarity value, the link ratio and/or the sign ratio, so that the accuracy of text extraction can be further improved; in the embodiment of the invention, the text additional information of the webpage to be extracted is acquired, so that the integrity of the text is improved; in the embodiment of the present invention, the first index value may include a unit density of each unit area, so that the suspected text area may be selected by using the attribute information of the unit density; the second index value in the embodiment of the present invention may include a feature vector value, so that a similarity value may be calculated by means of the feature vector value.

The above-described embodiments should not be construed as limiting the scope of the invention. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may occur depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for extracting a webpage text is characterized by comprising the following steps:

constructing an access model according to a webpage to be extracted, wherein the access model comprises: a feature portion and a body portion;

calculating a similarity value between each unit region of the main body portion and the feature portion; wherein a second index value of the feature portion and a second index value of each unit area of the body portion are calculated; calculating similarity values of the feature portion and the unit regions by using the second index value of the feature portion and the second index values of the unit regions; the characteristic part comprises characteristic information of a webpage to be extracted, and the second index value is generated according to the characteristic information;

screening unit text areas from the access model according to the similar values and the first index values of the unit areas; wherein the content of the first and second substances,

according to the first index value of each unit area, selecting a suspected text area from the access model, and screening the unit text area from the suspected text area by using the similarity value;

determining the beginning and the end of the text of the webpage to be extracted according to the unit text area to obtain the complete text of the webpage to be extracted; wherein the content of the first and second substances,

the determining the beginning and the end of the text of the webpage to be extracted according to the unit text area comprises: performing iterative traversal of upward and downward unit areas by taking the unit text area as a center, judging whether each unit area meets a preset text condition, and stopping iteration if each unit area does not meet the preset text condition so as to determine the beginning and the end of the text of the webpage to be extracted;

the judging whether each unit area meets the preset text condition comprises the following steps: judging whether the similarity value of each unit area is greater than a preset similarity threshold value or not, and if so, determining that the unit area meets a preset text condition; and/or judging whether the link ratio of each unit area is smaller than a preset link ratio threshold value, and if so, determining that the unit area meets a preset text condition; and/or judging whether the symbol ratio of each unit area is greater than a preset symbol ratio threshold value, and if so, determining that the unit area meets a preset text condition.

2. The method of claim 1, wherein prior to building an access model from the web page to be extracted, the method further comprises: and carrying out standardization processing on the source code of the webpage to be extracted.

3. The method of claim 1, wherein using the similarity value to filter the unit body area from the suspected text area comprises:

and comparing the similarity value of each unit area in the suspected text area, and selecting the unit area with the maximum similarity value as a unit text area.

4. The method according to claim 1, wherein after determining the beginning and the end of the body of the web page to be extracted from the unit body area, the method further comprises: acquiring text additional information of the webpage to be extracted, wherein the text additional information comprises at least one of the following information: text title, author, date and source.

5. The method of claim 1, wherein the access model is a text object model.

6. The method of claim 1, wherein each unit area is in units of rows.

7. The method according to claim 1, wherein the first index value is used to represent attribute information of each unit area, and includes: unit density of each unit area.

8. The method of claim 1, wherein the second index value is used to represent attribute information of an area in the web page, and comprises: a characteristic vector value.

9. An apparatus for extracting text from a web page, comprising:

the construction module is used for constructing an access model according to the webpage to be extracted, and the access model comprises: a feature portion and a body portion;

a calculation module for calculating a similarity value between each unit region of the main body portion and the feature portion; the calculation module is further used for calculating a second index value of the characteristic part and a second index value of each unit area of the main body part; calculating a similarity value between the feature portion and each unit region by using a second index value of the feature portion and a second index value of each unit region; the characteristic part comprises characteristic information of a webpage to be extracted, and the second index value is generated according to the characteristic information;

the screening module is used for screening the unit text area from the access model according to the similarity value and the first index value of each unit area; wherein the content of the first and second substances,

the determining module is used for determining the beginning and the end of the text of the webpage to be extracted according to the unit text area so as to obtain the complete text of the webpage to be extracted; wherein the content of the first and second substances,

the determining the beginning and the end of the text of the webpage to be extracted according to the unit text area comprises the following steps: performing iterative traversal of upward and downward unit areas by taking the unit text area as a center, judging whether each unit area meets a preset text condition, and stopping iteration if each unit area does not meet the preset text condition so as to determine the beginning and the end of the text of the webpage to be extracted;

10. The apparatus of claim 9, wherein the build module is further configured to: before an access model is built according to a webpage to be extracted, the source code of the webpage to be extracted is subjected to standardization processing.

11. The apparatus of claim 9, wherein the screening module is further configured to:

12. The apparatus of claim 9, wherein the determining module is further configured to: acquiring text additional information of the webpage to be extracted, wherein the text additional information comprises at least one of the following information: text title, author, date and source.

13. The apparatus of claim 9, wherein the access model is a text object model.

14. The apparatus of claim 9, wherein each unit area is in units of rows.

15. The apparatus according to claim 9, wherein the first index value is used to represent attribute information of each unit area, and includes: unit density of each unit area.

16. The apparatus of claim 9, wherein the second index value is used to represent attribute information of an area in a web page, and comprises: a characteristic vector value.

17. An electronic device, comprising:

one or more processors;

a storage device for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-8.

18. A computer-readable medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-8.