CN110795933B - Webpage text recognition processing method and device - Google Patents

Webpage text recognition processing method and device Download PDF

Info

Publication number
CN110795933B
CN110795933B CN201910945459.1A CN201910945459A CN110795933B CN 110795933 B CN110795933 B CN 110795933B CN 201910945459 A CN201910945459 A CN 201910945459A CN 110795933 B CN110795933 B CN 110795933B
Authority
CN
China
Prior art keywords
text
webpage
boundary
characters
web page
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910945459.1A
Other languages
Chinese (zh)
Other versions
CN110795933A (en
Inventor
禹庆华
叶盛
李凯
沈鹏
李国辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Qianxin Technology Group Co Ltd
Secworld Information Technology Beijing Co Ltd
Original Assignee
Qianxin Technology Group Co Ltd
Secworld Information Technology Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Qianxin Technology Group Co Ltd, Secworld Information Technology Beijing Co Ltd filed Critical Qianxin Technology Group Co Ltd
Priority to CN201910945459.1A priority Critical patent/CN110795933B/en
Publication of CN110795933A publication Critical patent/CN110795933A/en
Application granted granted Critical
Publication of CN110795933B publication Critical patent/CN110795933B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Information Transfer Between Computers (AREA)
  • Document Processing Apparatus (AREA)

Abstract

The embodiment of the invention discloses a method and a device for identifying and processing a webpage text, wherein the method comprises the following steps: acquiring a webpage source code of a webpage to be identified, and clearing all webpage labels in the webpage source code to obtain a webpage text comprising blank lines; partitioning the webpage text according to blank lines to obtain a plurality of text blocks, wherein blank lines are arranged between the text blocks; counting the number of characters of each character block, determining the boundary of the webpage text according to the number of characters of each character block, and identifying the webpage text in the webpage to be identified according to the boundary of the webpage text. The embodiment of the invention determines the boundary of the webpage text by counting the number of the characters of each character block, and identifies the webpage text in the webpage to be identified according to the boundary, is suitable for extracting the webpage text of all types, has simple extraction process, and greatly improves the accuracy and generalization of the webpage text extraction.

Description

Webpage text recognition processing method and device
Technical Field
The invention relates to the technical field of computers, in particular to a method and a device for identifying and processing a webpage text.
Background
The current web page text extraction mainly adopts DOM (Document Object Model ) to analyze web page HTML (HyperText Markup Language ) source code, and analyzes HTML into a tree structure. And extracting the text of the webpage based on a certain set rule by analyzing the tree structure. However, the HTML structure of web pages is diversified, and each web page is designed differently, for example, the website structure of e-commerce and news stories is quite different.
Therefore, the existing webpage text extraction method is low in stability, accurate in extraction of some types of webpages, inaccurate in extraction of other types of webpages, possible to extract some information of the edges of the webpages, and weak in generalization capability.
Disclosure of Invention
Because the existing method has the problems, the embodiment of the invention provides a method and a device for identifying and processing a webpage text.
In a first aspect, an embodiment of the present invention provides a method for identifying and processing a web page text, including:
acquiring a webpage source code of a webpage to be identified, and clearing all webpage labels in the webpage source code to obtain a webpage text comprising blank lines;
partitioning the webpage text according to blank lines to obtain a plurality of text blocks, wherein blank lines are arranged between the text blocks;
counting the number of characters of each character block, determining the boundary of the webpage text according to the number of characters of each character block, and identifying the webpage text in the webpage to be identified according to the boundary of the webpage text.
Optionally, determining the boundary of the web page text according to the number of the characters of each character block specifically includes:
according to the number of the characters of the current character block and the number of the characters of the last character block, calculating to obtain a difference value of the number of the characters of the current character block and the number of the characters of the last character block;
if the difference value of the number of words is larger than a threshold value, determining that the starting position of the current word block or the ending position of the last word block is the boundary of the text of the webpage.
Optionally, if the difference between the number of words is greater than a threshold, determining that the starting position of the current word block or the ending position of the last word block is the boundary of the text of the web page specifically includes:
if the difference value of the number of the characters is positive, determining that the starting position of the current character block is the starting boundary of the text of the webpage;
and if the difference value of the number of the characters is a negative number, determining the ending position of the last character block as the ending boundary of the text of the webpage.
Optionally, the identifying the webpage text in the webpage to be identified according to the boundary of the webpage text specifically includes:
recognizing characters between a starting boundary and a next ending boundary of each webpage text as text parts;
and combining all the text parts to obtain the webpage text in the webpage to be identified.
Optionally, the method for identifying and processing the webpage text further comprises the following steps:
if the number of the starting boundaries and the ending boundaries of the webpage text is equal and the interval between the starting boundaries and the ending boundaries appears, the starting boundaries and the ending boundaries of the webpage text are determined to be correctly identified.
Optionally, the threshold is determined according to the average number of blank blocks or the average number of text blocks;
wherein the average number of the blank blocks is the average value of the number of characters of each blank block;
the average number of the character blocks is the average value of the character number of each character block.
Optionally, the obtaining the web page source code of the web page to be identified, and removing all web page tags in the web page source code to obtain a web page text including blank lines specifically includes:
and crawling the webpage source codes of the webpages to be identified through the webcrawlers, identifying all webpage labels in the webpage source codes according to preset labels, and clearing the webpage labels to obtain the webpage text comprising blank lines.
In a second aspect, an embodiment of the present invention further provides a device for identifying and processing a web page text, including:
the tag clearing module is used for acquiring the webpage source codes of the webpages to be identified, clearing all webpage tags in the webpage source codes and obtaining webpage texts comprising blank rows;
the text block module is used for dividing the webpage text into blocks according to blank lines to obtain a plurality of text blocks, and blank lines are arranged among the text blocks;
the boundary determining module is used for counting the number of characters of each character block, determining the boundary of the webpage text according to the number of characters of each character block, and identifying the webpage text in the webpage to be identified according to the boundary of the webpage text.
Optionally, the boundary determining module is specifically configured to:
according to the number of the characters of the current character block and the number of the characters of the last character block, calculating to obtain a difference value of the number of the characters of the current character block and the number of the characters of the last character block;
if the difference value of the number of words is larger than a threshold value, determining that the starting position of the current word block or the ending position of the last word block is the boundary of the text of the webpage.
Optionally, the boundary determining module is specifically configured to:
if the difference value of the number of the characters is positive, determining that the starting position of the current character block is the starting boundary of the text of the webpage;
and if the difference value of the number of the characters is a negative number, determining the ending position of the last character block as the ending boundary of the text of the webpage.
Optionally, the boundary determining module is specifically configured to:
recognizing characters between a starting boundary and a next ending boundary of each webpage text as text parts;
and combining all the text parts to obtain the webpage text in the webpage to be identified.
Optionally, the device for identifying and processing the web page text further comprises:
and the boundary judging module is used for determining that the starting boundary and the ending boundary of the webpage text are correctly identified if the starting boundary and the ending boundary of the webpage text are equal in number and the interval between the starting boundary and the ending boundary appears.
Optionally, the threshold is determined according to the average number of blank blocks or the average number of text blocks;
wherein the average number of the blank blocks is the average value of the number of characters of each blank block;
the average number of the character blocks is the average value of the character number of each character block.
Optionally, the tag clearing module is specifically configured to:
and crawling the webpage source codes of the webpages to be identified through the webcrawlers, identifying all webpage labels in the webpage source codes according to preset labels, and clearing the webpage labels to obtain the webpage text comprising blank lines.
In a third aspect, an embodiment of the present invention further provides an electronic device, including:
at least one processor; and
at least one memory communicatively coupled to the processor, wherein:
the memory stores program instructions executable by the processor, which are called by the processor to perform the method described above.
In a fourth aspect, embodiments of the present invention also propose a non-transitory computer-readable storage medium storing a computer program, which causes the computer to carry out the above-mentioned method.
According to the technical scheme, the boundary of the webpage text is determined by counting the number of the characters of each character block, the webpage text in the webpage to be identified is identified according to the boundary, the method is suitable for extracting all types of webpage text, the extraction process is simple, and the accuracy and generalization of webpage text extraction are greatly improved.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions of the prior art, the drawings that are necessary for the description of the embodiments or the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention and that other drawings can be obtained from these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a flow chart of a method for identifying and processing a web page text according to an embodiment of the present invention;
FIG. 2 is a schematic diagram showing a comparison between before and after removing a web page tag in a web page source code according to an embodiment of the present invention;
FIG. 3 is a diagram showing the comparison of the difference in the number of words and the number of blank lines for each word block according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of a device for identifying and processing text of a web page according to an embodiment of the present invention;
fig. 5 is a logic block diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
The following describes the embodiments of the present invention further with reference to the accompanying drawings. The following examples are only for more clearly illustrating the technical aspects of the present invention, and are not intended to limit the scope of the present invention.
Fig. 1 shows a flow chart of a method for identifying and processing a web page text, which includes:
s101, acquiring a webpage source code of a webpage to be identified, and clearing all webpage labels in the webpage source code to obtain a webpage text comprising blank lines.
The webpage to be identified is a webpage in which all texts in the webpage need to be identified.
The source code of the web page is the source code used for writing the web page, and the source code displayed in the upper interface of fig. 2 is the source code of the web page.
The web page tag is a tag in the web page source code for identifying different source codes, such as < em >.
The web page text is the web page source code of which only text remains after the blank line is cleared, and the content displayed in the interface below the interface in fig. 2 is the web page text.
S102, partitioning the webpage text according to blank lines to obtain a plurality of text blocks, wherein blank lines are arranged between the text blocks.
The text blocks are blocks formed by a plurality of rows of text without blank lines, such as 4 text blocks in the lower interface of fig. 2.
S103, counting the number of characters of each character block, determining the boundary of the webpage text according to the number of characters of each character block, and identifying the webpage text in the webpage to be identified according to the boundary of the webpage text.
Wherein the word number is the total word number of the words in one word block.
And the boundary of the webpage text is the starting position and the ending position of the text part in the webpage.
The webpage text is a text part in the webpage to be identified, and does not comprise all levels of titles in the webpage to be identified.
The method and the device for identifying the webpage text in the webpage to be identified are suitable for extracting all types of webpage text by counting the number of characters of each character block, simple in extraction process and capable of greatly improving the accuracy and generalization of webpage text extraction.
Further, on the basis of the above method embodiment, determining the boundary of the web page text according to the number of words in each word block in S103 specifically includes:
and calculating the difference value of the number of words of the current word block and the last word block according to the number of words of the current word block and the number of words of the last word block.
If the difference value of the number of words is larger than a threshold value, determining that the starting position of the current word block or the ending position of the last word block is the boundary of the text of the webpage.
The word number difference is the difference between the word number of the current word block and the word number of the last word block.
The threshold is determined based on the average number of blank blocks or the average number of text blocks.
The average number of the blank blocks is the average value of the number of characters of each blank block.
The average number of the character blocks is the average value of the character number of each character block.
For example, if the average number of blank blocks is n, the threshold may be 2n; or, the average number of text blocks is m, the threshold may be 2m.
By calculating the difference of the number of words, the difference of the number of words between each word block can be obtained, when the difference is large, namely, the number of words of the current word block is much larger than that of the word of the last word block, the words of the current word block are much larger, the words of the last word block are little, the words of the last word block which are fewer words such as the title can be further explained, and the current word block is the text of the webpage.
Further, if the difference value of the number of words is greater than a threshold value, determining that the starting position of the current word block or the ending position of the last word block is the boundary of the text of the web page specifically includes:
and if the difference value of the number of the characters is a positive number, determining the starting position of the current character block as the starting boundary of the text of the webpage.
And if the difference value of the number of the characters is a negative number, determining the ending position of the last character block as the ending boundary of the text of the webpage.
Specifically, if the difference between the number of words is a positive number, the number of words of the current word block is larger than the number of words of the last word block, that is, the number of words of the current word block is larger, the number of words of the last word block is smaller, and it can be further stated that the starting position of the current word block is the starting boundary of the text of the web page.
Similarly, if the difference between the number of words is a negative number, the number of words of the current word block is much smaller than the number of words of the last word block, that is, the number of words of the current word block is little, the number of words of the last word block is much, and it can be further stated that the end position of the last word block is the end boundary of the text of the web page.
And the starting boundary and the ending boundary of the text of the webpage can be rapidly determined by judging the positive and negative of the difference value of the number of characters.
As shown in FIG. 2, the upper part is the HTML source code crawled by the crawler, and the lower part is the result of removing the HTML tags. For the removed result, the difference between the number of blank line blocks and the number of adjacent text blocks is counted, for example, there are 5 blank lines in some blank blocks, the text blocks above the blank blocks have 10 text, the text blocks below the blank blocks have 500 text, and the difference between the number of text blocks of the two adjacent text blocks is 490 (500-10=490). If the threshold is 300, the starting position of the lower text block is the starting boundary of the text of the web page. And further inquiring the starting boundary and the ending boundary of the text part to finally obtain the webpage text.
Further, on the basis of the above method embodiment, identifying the web page text in the web page to be identified according to the boundary of the web page text in S103 specifically includes:
recognizing characters between a starting boundary and a next ending boundary of each webpage text as text parts;
and combining all the text parts to obtain the webpage text in the webpage to be identified.
Specifically, the difference of the number of words in different word blocks is counted, as shown in fig. 3, the serial number of the abscissa is the serial number of each word block, each word block has two statistics items, the first is the difference of the number of words (the difference of the number of words in the text), and the second is the adjacent blank line number. It can be seen that the difference between the characters at 3 and 11 becomes larger and the number of blank lines is larger, which is the beginning of a text. 7 and 14 are similar except that the text difference is a negative number, this position is the end position of the text. Thus, two body parts can be found by FIG. 3, 3-7 being the first part, 11-14 being the second part, the sum of the two being the body of the entire web page.
The text parts can be quickly determined through the starting boundary and the ending boundary, and the web page text of the web page to be identified can be quickly obtained by combining the text parts.
Further, on the basis of the above method embodiment, the method for identifying and processing the web page text further includes:
and S104, if the number of the start boundaries and the end boundaries of the webpage text is equal and the start boundaries and the end boundaries are separated, determining that the start boundaries and the end boundaries of the webpage text are correctly identified.
Specifically, when the number of the recognized start boundaries and end boundaries is not equal, or at least 2 start boundaries appear continuously, or at least 2 end boundaries appear continuously, it is stated that the recognized start boundaries and end boundaries of the web page text are wrong. Therefore, it must be ensured that the identified start and end boundaries are correct to get the correct web page body.
Further, on the basis of the above method embodiment, S101 specifically includes:
and crawling the webpage source codes of the webpages to be identified through the webcrawlers, identifying all webpage labels in the webpage source codes according to preset labels, and clearing the webpage labels to obtain the webpage text comprising blank lines.
Specifically, firstly, crawling HTML source codes of a webpage by using a web crawler; then the HTML is de-labeled, the characters on the webpage are left, and all blank rows after the labels are removed are left at the same time; and then counting the distribution situation of the blank lines, namely blank line blocks. For example, there are 5 blank rows that are consecutive together, and then the 5 blank rows are blocks of one blank row. Meanwhile, counting the number of characters between each blank line block, namely a character block, and calculating the difference value of the number of characters between two adjacent character blocks; and finally, searching the beginning and ending positions of the text of the webpage, setting a threshold value, and selecting the position with more blank lines and large text difference as the beginning or ending boundary of part of the text. If the difference is positive, the boundary is a start boundary, otherwise, the boundary is an end boundary. All the body parts are combined together as the body of the web page in top-down order.
The method provided by the embodiment does not depend on DOM analysis, searches the boundary of the text by counting the blank area and the text density, greatly improves the generalization of text extraction, and has higher accuracy.
Fig. 4 shows a schematic structural diagram of an apparatus for identifying and processing a web page text according to the present embodiment, where the apparatus includes: a tag removal module 401, a text segmentation module 402, and a boundary determination module 403, wherein:
the tag clearing module 401 is configured to obtain a web page source code of a web page to be identified, clear all web page tags in the web page source code, and obtain a web page text including blank lines;
the text blocking module 402 is configured to block the web page text according to blank lines to obtain a plurality of text blocks, where blank lines are between the text blocks;
the boundary determining module 403 is configured to count the number of words in each word block, determine the boundary of the web page text according to the number of words in each word block, and identify the web page text in the web page to be identified according to the boundary of the web page text.
Specifically, the tag clearing module 401 obtains the web page source code of the web page to be identified, and clears all web page tags in the web page source code to obtain a web page text including blank lines; the text blocking module 402 blocks the web page text according to blank lines to obtain a plurality of text blocks, wherein blank lines are arranged between the text blocks; the boundary determining module 403 counts the number of words in each word block, determines the boundary of the web page text according to the number of words in each word block, and identifies the web page text in the web page to be identified according to the boundary of the web page text.
The method and the device for identifying the webpage text in the webpage to be identified are suitable for extracting all types of webpage text by counting the number of characters of each character block, simple in extraction process and capable of greatly improving the accuracy and generalization of webpage text extraction.
Further, on the basis of the above apparatus embodiment, the boundary determining module 403 is specifically configured to:
according to the number of the characters of the current character block and the number of the characters of the last character block, calculating to obtain a difference value of the number of the characters of the current character block and the number of the characters of the last character block;
if the difference value of the number of words is larger than a threshold value, determining that the starting position of the current word block or the ending position of the last word block is the boundary of the text of the webpage.
Further, on the basis of the above apparatus embodiment, the boundary determining module 403 is specifically configured to:
if the difference value of the number of the characters is positive, determining that the starting position of the current character block is the starting boundary of the text of the webpage;
and if the difference value of the number of the characters is a negative number, determining the ending position of the last character block as the ending boundary of the text of the webpage.
Further, on the basis of the above apparatus embodiment, the boundary determining module 403 is specifically configured to:
recognizing characters between a starting boundary and a next ending boundary of each webpage text as text parts;
and combining all the text parts to obtain the webpage text in the webpage to be identified.
Further, on the basis of the above device embodiment, the device for identifying and processing the web page text further includes:
and the boundary judging module is used for determining that the starting boundary and the ending boundary of the webpage text are correctly identified if the starting boundary and the ending boundary of the webpage text are equal in number and the interval between the starting boundary and the ending boundary appears.
Further, on the basis of the above device embodiment, the threshold is determined according to an average number of blank blocks or an average number of text blocks;
wherein the average number of the blank blocks is the average value of the number of characters of each blank block;
the average number of the character blocks is the average value of the character number of each character block.
Further, on the basis of the above device embodiment, the tag removing module 401 is specifically configured to:
and crawling the webpage source codes of the webpages to be identified through the webcrawlers, identifying all webpage labels in the webpage source codes according to preset labels, and clearing the webpage labels to obtain the webpage text comprising blank lines.
The device for identifying and processing the text of the web page in this embodiment may be used to execute the method embodiment, and its principle and technical effects are similar, and are not described herein again.
Referring to fig. 5, the electronic device includes: a processor (processor) 501, a memory (memory) 502, and a bus 503;
wherein, the liquid crystal display device comprises a liquid crystal display device,
the processor 501 and the memory 502 complete communication with each other via the bus 503;
the processor 501 is configured to invoke the program instructions in the memory 502 to perform the methods provided by the method embodiments described above.
The present embodiments disclose a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, are capable of performing the methods provided by the method embodiments described above.
The present embodiment provides a non-transitory computer-readable storage medium storing computer instructions that cause a computer to perform the methods provided by the above-described method embodiments.
The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.
It should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (14)

1. The method for identifying and processing the webpage text is characterized by comprising the following steps of:
acquiring a webpage source code of a webpage to be identified, and clearing all webpage labels in the webpage source code to obtain a webpage text comprising blank lines;
partitioning the webpage text according to blank lines to obtain a plurality of text blocks, wherein blank lines are arranged between the text blocks;
counting the number of characters of each character block, determining the boundary of the webpage text according to the number of characters of each character block, and identifying the webpage text in the webpage to be identified according to the boundary of the webpage text;
the determining the boundary of the webpage text according to the text number of each text block specifically comprises the following steps:
according to the number of the characters of the current character block and the number of the characters of the last character block, calculating to obtain a difference value of the number of the characters of the current character block and the number of the characters of the last character block;
if the difference value of the number of the characters is larger than a threshold value, determining that the starting position of the current character block or the ending position of the last character block is the boundary of the text of the webpage; and determining the starting boundary and the ending boundary of the text of the webpage by judging the positive and negative of the difference value of the number of characters.
2. The method for recognizing and processing the text of the web page according to claim 1, wherein if the difference between the number of words is greater than a threshold, determining that the start position of the current word block or the end position of the last word block is the boundary of the text of the web page comprises:
if the difference value of the number of the characters is positive, determining that the starting position of the current character block is the starting boundary of the text of the webpage;
and if the difference value of the number of the characters is a negative number, determining the ending position of the last character block as the ending boundary of the text of the webpage.
3. The method for identifying and processing the web page text according to claim 2, wherein the identifying the web page text in the web page to be identified according to the boundary of the web page text specifically comprises:
recognizing characters between a starting boundary and a next ending boundary of each webpage text as text parts;
and combining all the text parts to obtain the webpage text in the webpage to be identified.
4. The web page text recognition processing method according to claim 2, wherein the web page text recognition processing method further comprises:
if the number of the starting boundaries and the ending boundaries of the webpage text is equal and the interval between the starting boundaries and the ending boundaries appears, the starting boundaries and the ending boundaries of the webpage text are determined to be correctly identified.
5. The method for recognizing and processing the text of the web page according to claim 1, wherein the threshold is determined according to an average number of blank blocks or an average number of text blocks;
wherein the average number of the blank blocks is the average value of the number of characters of each blank block;
the average number of the character blocks is the average value of the character number of each character block.
6. The method for recognizing and processing the text of a web page according to claim 1, wherein the steps of obtaining the web page source code of the web page to be recognized, and removing all the web page tags in the web page source code to obtain the web page text including blank lines comprise:
and crawling the webpage source codes of the webpages to be identified through the webcrawlers, identifying all webpage labels in the webpage source codes according to preset labels, and clearing the webpage labels to obtain the webpage text comprising blank lines.
7. An identification processing device for web page text, characterized by comprising:
the tag clearing module is used for acquiring the webpage source codes of the webpages to be identified, clearing all webpage tags in the webpage source codes and obtaining webpage texts comprising blank rows;
the text block module is used for dividing the webpage text into blocks according to blank lines to obtain a plurality of text blocks, and blank lines are arranged among the text blocks;
the boundary determining module is used for counting the number of characters of each character block, determining the boundary of the webpage text according to the number of characters of each character block, and identifying the webpage text in the webpage to be identified according to the boundary of the webpage text;
the boundary determining module is specifically configured to:
according to the number of the characters of the current character block and the number of the characters of the last character block, calculating to obtain a difference value of the number of the characters of the current character block and the number of the characters of the last character block;
if the difference value of the number of the characters is larger than a threshold value, determining that the starting position of the current character block or the ending position of the last character block is the boundary of the text of the webpage; and determining the starting boundary and the ending boundary of the text of the webpage by judging the positive and negative of the difference value of the number of characters.
8. The apparatus for recognizing and processing a web page text according to claim 7, wherein the boundary determining module is specifically configured to:
if the difference value of the number of the characters is positive, determining that the starting position of the current character block is the starting boundary of the text of the webpage;
and if the difference value of the number of the characters is a negative number, determining the ending position of the last character block as the ending boundary of the text of the webpage.
9. The apparatus for recognizing and processing a web page text according to claim 8, wherein the boundary determining module is specifically configured to:
recognizing characters between a starting boundary and a next ending boundary of each webpage text as text parts;
and combining all the text parts to obtain the webpage text in the webpage to be identified.
10. The apparatus according to claim 8, wherein the apparatus further comprises:
and the boundary judging module is used for determining that the starting boundary and the ending boundary of the webpage text are correctly identified if the starting boundary and the ending boundary of the webpage text are equal in number and the interval between the starting boundary and the ending boundary appears.
11. The apparatus according to claim 7, wherein the threshold is determined according to an average number of blank blocks or an average number of text blocks;
wherein the average number of the blank blocks is the average value of the number of characters of each blank block;
the average number of the character blocks is the average value of the character number of each character block.
12. The apparatus for recognizing and processing a web page text according to claim 7, wherein the tag removing module is specifically configured to:
and crawling the webpage source codes of the webpages to be identified through the webcrawlers, identifying all webpage labels in the webpage source codes according to preset labels, and clearing the webpage labels to obtain the webpage text comprising blank lines.
13. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of identifying the text of a web page as claimed in any one of claims 1 to 6 when the program is executed by the processor.
14. A non-transitory computer-readable storage medium having stored thereon a computer program, wherein the computer program, when executed by a processor, implements the method of identifying a web page body as claimed in any one of claims 1 to 6.
CN201910945459.1A 2019-09-30 2019-09-30 Webpage text recognition processing method and device Active CN110795933B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910945459.1A CN110795933B (en) 2019-09-30 2019-09-30 Webpage text recognition processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910945459.1A CN110795933B (en) 2019-09-30 2019-09-30 Webpage text recognition processing method and device

Publications (2)

Publication Number Publication Date
CN110795933A CN110795933A (en) 2020-02-14
CN110795933B true CN110795933B (en) 2023-10-31

Family

ID=69438918

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910945459.1A Active CN110795933B (en) 2019-09-30 2019-09-30 Webpage text recognition processing method and device

Country Status (1)

Country Link
CN (1) CN110795933B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112230989B (en) * 2020-12-14 2021-03-12 北京智慧星光信息技术有限公司 Webpage channel navigation bar extraction method, system, electronic equipment and storage medium
CN113537091B (en) * 2021-07-20 2024-05-03 东莞盟大集团有限公司 Webpage text recognition method and device, electronic equipment and storage medium

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104598577A (en) * 2015-01-14 2015-05-06 晶赞广告(上海)有限公司 Extraction method for webpage text
CN105022803A (en) * 2015-07-01 2015-11-04 广州市万隆证券咨询顾问有限公司 Method and system for extracting text content of webpage
CN105183801A (en) * 2015-08-25 2015-12-23 北京信息科技大学 Web page body text extraction method and apparatus
CN105512225A (en) * 2015-11-30 2016-04-20 北大方正集团有限公司 Method and device extracting main content from webpage
WO2017080090A1 (en) * 2015-11-14 2017-05-18 孙燕群 Extraction and comparison method for text of webpage
CN108763591A (en) * 2018-06-21 2018-11-06 湖南星汉数智科技有限公司 A kind of webpage context extraction method, device, computer installation and computer readable storage medium
CN109086361A (en) * 2018-07-20 2018-12-25 北京开普云信息科技有限公司 A kind of automatic abstracting method of webpage article information and system based on mutual information between web page joint
CN109543126A (en) * 2018-11-19 2019-03-29 四川长虹电器股份有限公司 Web page text information extracting method based on block text accounting

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104598577A (en) * 2015-01-14 2015-05-06 晶赞广告(上海)有限公司 Extraction method for webpage text
CN105022803A (en) * 2015-07-01 2015-11-04 广州市万隆证券咨询顾问有限公司 Method and system for extracting text content of webpage
CN105183801A (en) * 2015-08-25 2015-12-23 北京信息科技大学 Web page body text extraction method and apparatus
WO2017080090A1 (en) * 2015-11-14 2017-05-18 孙燕群 Extraction and comparison method for text of webpage
CN105512225A (en) * 2015-11-30 2016-04-20 北大方正集团有限公司 Method and device extracting main content from webpage
CN108763591A (en) * 2018-06-21 2018-11-06 湖南星汉数智科技有限公司 A kind of webpage context extraction method, device, computer installation and computer readable storage medium
CN109086361A (en) * 2018-07-20 2018-12-25 北京开普云信息科技有限公司 A kind of automatic abstracting method of webpage article information and system based on mutual information between web page joint
CN109543126A (en) * 2018-11-19 2019-03-29 四川长虹电器股份有限公司 Web page text information extracting method based on block text accounting

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
廖建军 ; .基于标签样式和密度模型的网页正文自动抽取.情报科学.2018,(07),全文. *
沈劲枝 ; 寇文波 ; 田晨耕 ; .基于特征定位边界预测的Web档案正文采集.现代图书情报技术.2009,(12),全文. *

Also Published As

Publication number Publication date
CN110795933A (en) 2020-02-14

Similar Documents

Publication Publication Date Title
US8819028B2 (en) System and method for web content extraction
CN102541874B (en) Webpage text content extracting method and device
CN104598577B (en) A kind of extracting method of Web page text
CN108920434B (en) Universal webpage theme content extraction method and system
CN110390038B (en) Page blocking method, device and equipment based on DOM tree and storage medium
WO2020000717A1 (en) Web page classification method and device, and computer-readable storage medium
CN105320734B (en) A kind of web page core content extracting method
CN105630941A (en) Statistics and webpage structure based Wen body text content extraction method
EP3349124A1 (en) Method and system for generating parsed document from digital document
CN110795933B (en) Webpage text recognition processing method and device
CN103605691A (en) Device and method used for processing issued contents in social network
CN111797356B (en) Webpage form information extraction method and device
CN109144513B (en) Method for automatically extracting list page
CN112650910A (en) Method, device, equipment and storage medium for determining website update information
CN106227770A (en) A kind of intelligentized news web page information extraction method
CN104615728B (en) A kind of webpage context extraction method and device
CN108694192B (en) Webpage type judging method and device
US20130167018A1 (en) Methods and Devices for Extracting Document Structure
CN112232075A (en) Article release time identification method based on time format and webpage element characteristics
CN109472020A (en) A kind of feature alignment Chinese word cutting method
CN103455572A (en) Method and device for acquiring movie and television subjects from web pages
CN106897287B (en) Webpage release time extraction method and device for webpage release time extraction
CN115391711B (en) Webpage text information extraction method, device, equipment and medium
CN115238078A (en) Webpage information extraction method, device, equipment and storage medium
CN112559929B (en) Method, electronic device and medium for extracting webpage target information

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: Room 332, 3 / F, Building 102, 28 xinjiekouwei street, Xicheng District, Beijing 100088

Applicant after: Qianxin Technology Group Co.,Ltd.

Applicant after: Qianxin Wangshen information technology (Beijing) Co.,Ltd.

Address before: Room 332, 3 / F, Building 102, 28 xinjiekouwei street, Xicheng District, Beijing 100088

Applicant before: Qianxin Technology Group Co.,Ltd.

Applicant before: LEGENDSEC INFORMATION TECHNOLOGY (BEIJING) Inc.

GR01 Patent grant
GR01 Patent grant