CN110795933B

CN110795933B - Webpage text recognition processing method and device

Info

Publication number: CN110795933B
Application number: CN201910945459.1A
Authority: CN
Inventors: 禹庆华; 叶盛; 李凯; 沈鹏; 李国辉
Original assignee: Qianxin Technology Group Co Ltd; Secworld Information Technology Beijing Co Ltd
Current assignee: Qianxin Technology Group Co Ltd; Secworld Information Technology Beijing Co Ltd
Priority date: 2019-09-30
Filing date: 2019-09-30
Publication date: 2023-10-31
Anticipated expiration: 2039-09-30
Also published as: CN110795933A

Abstract

The embodiment of the invention discloses a method and a device for identifying and processing a webpage text, wherein the method comprises the following steps: acquiring a webpage source code of a webpage to be identified, and clearing all webpage labels in the webpage source code to obtain a webpage text comprising blank lines; partitioning the webpage text according to blank lines to obtain a plurality of text blocks, wherein blank lines are arranged between the text blocks; counting the number of characters of each character block, determining the boundary of the webpage text according to the number of characters of each character block, and identifying the webpage text in the webpage to be identified according to the boundary of the webpage text. The embodiment of the invention determines the boundary of the webpage text by counting the number of the characters of each character block, and identifies the webpage text in the webpage to be identified according to the boundary, is suitable for extracting the webpage text of all types, has simple extraction process, and greatly improves the accuracy and generalization of the webpage text extraction.

Description

Webpage text recognition processing method and device

Technical Field

The invention relates to the technical field of computers, in particular to a method and a device for identifying and processing a webpage text.

Background

The current web page text extraction mainly adopts DOM (Document Object Model ) to analyze web page HTML (HyperText Markup Language ) source code, and analyzes HTML into a tree structure. And extracting the text of the webpage based on a certain set rule by analyzing the tree structure. However, the HTML structure of web pages is diversified, and each web page is designed differently, for example, the website structure of e-commerce and news stories is quite different.

Therefore, the existing webpage text extraction method is low in stability, accurate in extraction of some types of webpages, inaccurate in extraction of other types of webpages, possible to extract some information of the edges of the webpages, and weak in generalization capability.

Disclosure of Invention

Because the existing method has the problems, the embodiment of the invention provides a method and a device for identifying and processing a webpage text.

In a first aspect, an embodiment of the present invention provides a method for identifying and processing a web page text, including:

acquiring a webpage source code of a webpage to be identified, and clearing all webpage labels in the webpage source code to obtain a webpage text comprising blank lines;

partitioning the webpage text according to blank lines to obtain a plurality of text blocks, wherein blank lines are arranged between the text blocks;

counting the number of characters of each character block, determining the boundary of the webpage text according to the number of characters of each character block, and identifying the webpage text in the webpage to be identified according to the boundary of the webpage text.

Optionally, determining the boundary of the web page text according to the number of the characters of each character block specifically includes:

according to the number of the characters of the current character block and the number of the characters of the last character block, calculating to obtain a difference value of the number of the characters of the current character block and the number of the characters of the last character block;

if the difference value of the number of words is larger than a threshold value, determining that the starting position of the current word block or the ending position of the last word block is the boundary of the text of the webpage.

Optionally, if the difference between the number of words is greater than a threshold, determining that the starting position of the current word block or the ending position of the last word block is the boundary of the text of the web page specifically includes:

if the difference value of the number of the characters is positive, determining that the starting position of the current character block is the starting boundary of the text of the webpage;

and if the difference value of the number of the characters is a negative number, determining the ending position of the last character block as the ending boundary of the text of the webpage.

Optionally, the identifying the webpage text in the webpage to be identified according to the boundary of the webpage text specifically includes:

recognizing characters between a starting boundary and a next ending boundary of each webpage text as text parts;

and combining all the text parts to obtain the webpage text in the webpage to be identified.

Optionally, the method for identifying and processing the webpage text further comprises the following steps:

if the number of the starting boundaries and the ending boundaries of the webpage text is equal and the interval between the starting boundaries and the ending boundaries appears, the starting boundaries and the ending boundaries of the webpage text are determined to be correctly identified.

Optionally, the threshold is determined according to the average number of blank blocks or the average number of text blocks;

wherein the average number of the blank blocks is the average value of the number of characters of each blank block;

the average number of the character blocks is the average value of the character number of each character block.

Optionally, the obtaining the web page source code of the web page to be identified, and removing all web page tags in the web page source code to obtain a web page text including blank lines specifically includes:

and crawling the webpage source codes of the webpages to be identified through the webcrawlers, identifying all webpage labels in the webpage source codes according to preset labels, and clearing the webpage labels to obtain the webpage text comprising blank lines.

In a second aspect, an embodiment of the present invention further provides a device for identifying and processing a web page text, including:

the tag clearing module is used for acquiring the webpage source codes of the webpages to be identified, clearing all webpage tags in the webpage source codes and obtaining webpage texts comprising blank rows;

the text block module is used for dividing the webpage text into blocks according to blank lines to obtain a plurality of text blocks, and blank lines are arranged among the text blocks;

the boundary determining module is used for counting the number of characters of each character block, determining the boundary of the webpage text according to the number of characters of each character block, and identifying the webpage text in the webpage to be identified according to the boundary of the webpage text.

Optionally, the boundary determining module is specifically configured to:

Optionally, the device for identifying and processing the web page text further comprises:

and the boundary judging module is used for determining that the starting boundary and the ending boundary of the webpage text are correctly identified if the starting boundary and the ending boundary of the webpage text are equal in number and the interval between the starting boundary and the ending boundary appears.

Optionally, the tag clearing module is specifically configured to:

In a third aspect, an embodiment of the present invention further provides an electronic device, including:

at least one processor; and

at least one memory communicatively coupled to the processor, wherein:

the memory stores program instructions executable by the processor, which are called by the processor to perform the method described above.

In a fourth aspect, embodiments of the present invention also propose a non-transitory computer-readable storage medium storing a computer program, which causes the computer to carry out the above-mentioned method.

According to the technical scheme, the boundary of the webpage text is determined by counting the number of the characters of each character block, the webpage text in the webpage to be identified is identified according to the boundary, the method is suitable for extracting all types of webpage text, the extraction process is simple, and the accuracy and generalization of webpage text extraction are greatly improved.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions of the prior art, the drawings that are necessary for the description of the embodiments or the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention and that other drawings can be obtained from these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flow chart of a method for identifying and processing a web page text according to an embodiment of the present invention;

FIG. 2 is a schematic diagram showing a comparison between before and after removing a web page tag in a web page source code according to an embodiment of the present invention;

FIG. 3 is a diagram showing the comparison of the difference in the number of words and the number of blank lines for each word block according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a device for identifying and processing text of a web page according to an embodiment of the present invention;

fig. 5 is a logic block diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The following describes the embodiments of the present invention further with reference to the accompanying drawings. The following examples are only for more clearly illustrating the technical aspects of the present invention, and are not intended to limit the scope of the present invention.

Fig. 1 shows a flow chart of a method for identifying and processing a web page text, which includes:

s101, acquiring a webpage source code of a webpage to be identified, and clearing all webpage labels in the webpage source code to obtain a webpage text comprising blank lines.

The webpage to be identified is a webpage in which all texts in the webpage need to be identified.

The source code of the web page is the source code used for writing the web page, and the source code displayed in the upper interface of fig. 2 is the source code of the web page.

The web page tag is a tag in the web page source code for identifying different source codes, such as < em >.

The web page text is the web page source code of which only text remains after the blank line is cleared, and the content displayed in the interface below the interface in fig. 2 is the web page text.

S102, partitioning the webpage text according to blank lines to obtain a plurality of text blocks, wherein blank lines are arranged between the text blocks.

The text blocks are blocks formed by a plurality of rows of text without blank lines, such as 4 text blocks in the lower interface of fig. 2.

S103, counting the number of characters of each character block, determining the boundary of the webpage text according to the number of characters of each character block, and identifying the webpage text in the webpage to be identified according to the boundary of the webpage text.

Wherein the word number is the total word number of the words in one word block.

And the boundary of the webpage text is the starting position and the ending position of the text part in the webpage.

The webpage text is a text part in the webpage to be identified, and does not comprise all levels of titles in the webpage to be identified.

The method and the device for identifying the webpage text in the webpage to be identified are suitable for extracting all types of webpage text by counting the number of characters of each character block, simple in extraction process and capable of greatly improving the accuracy and generalization of webpage text extraction.

Further, on the basis of the above method embodiment, determining the boundary of the web page text according to the number of words in each word block in S103 specifically includes:

and calculating the difference value of the number of words of the current word block and the last word block according to the number of words of the current word block and the number of words of the last word block.

The word number difference is the difference between the word number of the current word block and the word number of the last word block.

The threshold is determined based on the average number of blank blocks or the average number of text blocks.

The average number of the blank blocks is the average value of the number of characters of each blank block.

For example, if the average number of blank blocks is n, the threshold may be 2n; or, the average number of text blocks is m, the threshold may be 2m.

By calculating the difference of the number of words, the difference of the number of words between each word block can be obtained, when the difference is large, namely, the number of words of the current word block is much larger than that of the word of the last word block, the words of the current word block are much larger, the words of the last word block are little, the words of the last word block which are fewer words such as the title can be further explained, and the current word block is the text of the webpage.

Further, if the difference value of the number of words is greater than a threshold value, determining that the starting position of the current word block or the ending position of the last word block is the boundary of the text of the web page specifically includes:

and if the difference value of the number of the characters is a positive number, determining the starting position of the current character block as the starting boundary of the text of the webpage.

Specifically, if the difference between the number of words is a positive number, the number of words of the current word block is larger than the number of words of the last word block, that is, the number of words of the current word block is larger, the number of words of the last word block is smaller, and it can be further stated that the starting position of the current word block is the starting boundary of the text of the web page.

Similarly, if the difference between the number of words is a negative number, the number of words of the current word block is much smaller than the number of words of the last word block, that is, the number of words of the current word block is little, the number of words of the last word block is much, and it can be further stated that the end position of the last word block is the end boundary of the text of the web page.

And the starting boundary and the ending boundary of the text of the webpage can be rapidly determined by judging the positive and negative of the difference value of the number of characters.

As shown in FIG. 2, the upper part is the HTML source code crawled by the crawler, and the lower part is the result of removing the HTML tags. For the removed result, the difference between the number of blank line blocks and the number of adjacent text blocks is counted, for example, there are 5 blank lines in some blank blocks, the text blocks above the blank blocks have 10 text, the text blocks below the blank blocks have 500 text, and the difference between the number of text blocks of the two adjacent text blocks is 490 (500-10=490). If the threshold is 300, the starting position of the lower text block is the starting boundary of the text of the web page. And further inquiring the starting boundary and the ending boundary of the text part to finally obtain the webpage text.

Further, on the basis of the above method embodiment, identifying the web page text in the web page to be identified according to the boundary of the web page text in S103 specifically includes:

Specifically, the difference of the number of words in different word blocks is counted, as shown in fig. 3, the serial number of the abscissa is the serial number of each word block, each word block has two statistics items, the first is the difference of the number of words (the difference of the number of words in the text), and the second is the adjacent blank line number. It can be seen that the difference between the characters at 3 and 11 becomes larger and the number of blank lines is larger, which is the beginning of a text. 7 and 14 are similar except that the text difference is a negative number, this position is the end position of the text. Thus, two body parts can be found by FIG. 3, 3-7 being the first part, 11-14 being the second part, the sum of the two being the body of the entire web page.

The text parts can be quickly determined through the starting boundary and the ending boundary, and the web page text of the web page to be identified can be quickly obtained by combining the text parts.

Further, on the basis of the above method embodiment, the method for identifying and processing the web page text further includes:

and S104, if the number of the start boundaries and the end boundaries of the webpage text is equal and the start boundaries and the end boundaries are separated, determining that the start boundaries and the end boundaries of the webpage text are correctly identified.

Specifically, when the number of the recognized start boundaries and end boundaries is not equal, or at least 2 start boundaries appear continuously, or at least 2 end boundaries appear continuously, it is stated that the recognized start boundaries and end boundaries of the web page text are wrong. Therefore, it must be ensured that the identified start and end boundaries are correct to get the correct web page body.

Further, on the basis of the above method embodiment, S101 specifically includes:

Specifically, firstly, crawling HTML source codes of a webpage by using a web crawler; then the HTML is de-labeled, the characters on the webpage are left, and all blank rows after the labels are removed are left at the same time; and then counting the distribution situation of the blank lines, namely blank line blocks. For example, there are 5 blank rows that are consecutive together, and then the 5 blank rows are blocks of one blank row. Meanwhile, counting the number of characters between each blank line block, namely a character block, and calculating the difference value of the number of characters between two adjacent character blocks; and finally, searching the beginning and ending positions of the text of the webpage, setting a threshold value, and selecting the position with more blank lines and large text difference as the beginning or ending boundary of part of the text. If the difference is positive, the boundary is a start boundary, otherwise, the boundary is an end boundary. All the body parts are combined together as the body of the web page in top-down order.

The method provided by the embodiment does not depend on DOM analysis, searches the boundary of the text by counting the blank area and the text density, greatly improves the generalization of text extraction, and has higher accuracy.

Fig. 4 shows a schematic structural diagram of an apparatus for identifying and processing a web page text according to the present embodiment, where the apparatus includes: a tag removal module 401, a text segmentation module 402, and a boundary determination module 403, wherein:

the tag clearing module 401 is configured to obtain a web page source code of a web page to be identified, clear all web page tags in the web page source code, and obtain a web page text including blank lines;

the text blocking module 402 is configured to block the web page text according to blank lines to obtain a plurality of text blocks, where blank lines are between the text blocks;

the boundary determining module 403 is configured to count the number of words in each word block, determine the boundary of the web page text according to the number of words in each word block, and identify the web page text in the web page to be identified according to the boundary of the web page text.

Specifically, the tag clearing module 401 obtains the web page source code of the web page to be identified, and clears all web page tags in the web page source code to obtain a web page text including blank lines; the text blocking module 402 blocks the web page text according to blank lines to obtain a plurality of text blocks, wherein blank lines are arranged between the text blocks; the boundary determining module 403 counts the number of words in each word block, determines the boundary of the web page text according to the number of words in each word block, and identifies the web page text in the web page to be identified according to the boundary of the web page text.

Further, on the basis of the above apparatus embodiment, the boundary determining module 403 is specifically configured to:

Further, on the basis of the above device embodiment, the device for identifying and processing the web page text further includes:

Further, on the basis of the above device embodiment, the threshold is determined according to an average number of blank blocks or an average number of text blocks;

Further, on the basis of the above device embodiment, the tag removing module 401 is specifically configured to:

The device for identifying and processing the text of the web page in this embodiment may be used to execute the method embodiment, and its principle and technical effects are similar, and are not described herein again.

Referring to fig. 5, the electronic device includes: a processor (processor) 501, a memory (memory) 502, and a bus 503;

wherein, the liquid crystal display device comprises a liquid crystal display device,

the processor 501 and the memory 502 complete communication with each other via the bus 503;

the processor 501 is configured to invoke the program instructions in the memory 502 to perform the methods provided by the method embodiments described above.

The present embodiments disclose a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, are capable of performing the methods provided by the method embodiments described above.

The present embodiment provides a non-transitory computer-readable storage medium storing computer instructions that cause a computer to perform the methods provided by the above-described method embodiments.

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.

It should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. The method for identifying and processing the webpage text is characterized by comprising the following steps of:

counting the number of characters of each character block, determining the boundary of the webpage text according to the number of characters of each character block, and identifying the webpage text in the webpage to be identified according to the boundary of the webpage text;

the determining the boundary of the webpage text according to the text number of each text block specifically comprises the following steps:

if the difference value of the number of the characters is larger than a threshold value, determining that the starting position of the current character block or the ending position of the last character block is the boundary of the text of the webpage; and determining the starting boundary and the ending boundary of the text of the webpage by judging the positive and negative of the difference value of the number of characters.

2. The method for recognizing and processing the text of the web page according to claim 1, wherein if the difference between the number of words is greater than a threshold, determining that the start position of the current word block or the end position of the last word block is the boundary of the text of the web page comprises:

3. The method for identifying and processing the web page text according to claim 2, wherein the identifying the web page text in the web page to be identified according to the boundary of the web page text specifically comprises:

4. The web page text recognition processing method according to claim 2, wherein the web page text recognition processing method further comprises:

5. The method for recognizing and processing the text of the web page according to claim 1, wherein the threshold is determined according to an average number of blank blocks or an average number of text blocks;

6. The method for recognizing and processing the text of a web page according to claim 1, wherein the steps of obtaining the web page source code of the web page to be recognized, and removing all the web page tags in the web page source code to obtain the web page text including blank lines comprise:

7. An identification processing device for web page text, characterized by comprising:

the boundary determining module is used for counting the number of characters of each character block, determining the boundary of the webpage text according to the number of characters of each character block, and identifying the webpage text in the webpage to be identified according to the boundary of the webpage text;

the boundary determining module is specifically configured to:

8. The apparatus for recognizing and processing a web page text according to claim 7, wherein the boundary determining module is specifically configured to:

9. The apparatus for recognizing and processing a web page text according to claim 8, wherein the boundary determining module is specifically configured to:

10. The apparatus according to claim 8, wherein the apparatus further comprises:

11. The apparatus according to claim 7, wherein the threshold is determined according to an average number of blank blocks or an average number of text blocks;

12. The apparatus for recognizing and processing a web page text according to claim 7, wherein the tag removing module is specifically configured to:

13. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of identifying the text of a web page as claimed in any one of claims 1 to 6 when the program is executed by the processor.

14. A non-transitory computer-readable storage medium having stored thereon a computer program, wherein the computer program, when executed by a processor, implements the method of identifying a web page body as claimed in any one of claims 1 to 6.