CN110795933B - Webpage text recognition processing method and device - Google Patents
Webpage text recognition processing method and device Download PDFInfo
- Publication number
- CN110795933B CN110795933B CN201910945459.1A CN201910945459A CN110795933B CN 110795933 B CN110795933 B CN 110795933B CN 201910945459 A CN201910945459 A CN 201910945459A CN 110795933 B CN110795933 B CN 110795933B
- Authority
- CN
- China
- Prior art keywords
- text
- webpage
- boundary
- characters
- web page
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000003672 processing method Methods 0.000 title claims description 4
- 238000000034 method Methods 0.000 claims abstract description 35
- 238000000638 solvent extraction Methods 0.000 claims abstract description 4
- 238000004590 computer program Methods 0.000 claims description 7
- 230000009193 crawling Effects 0.000 claims description 7
- 238000000605 extraction Methods 0.000 abstract description 13
- 238000010586 diagram Methods 0.000 description 5
- 230000000903 blocking effect Effects 0.000 description 2
- 239000004973 liquid crystal related substance Substances 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Classifications
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Information Transfer Between Computers (AREA)
- Document Processing Apparatus (AREA)
Abstract
The embodiment of the invention discloses a method and a device for identifying and processing a webpage text, wherein the method comprises the following steps: acquiring a webpage source code of a webpage to be identified, and clearing all webpage labels in the webpage source code to obtain a webpage text comprising blank lines; partitioning the webpage text according to blank lines to obtain a plurality of text blocks, wherein blank lines are arranged between the text blocks; counting the number of characters of each character block, determining the boundary of the webpage text according to the number of characters of each character block, and identifying the webpage text in the webpage to be identified according to the boundary of the webpage text. The embodiment of the invention determines the boundary of the webpage text by counting the number of the characters of each character block, and identifies the webpage text in the webpage to be identified according to the boundary, is suitable for extracting the webpage text of all types, has simple extraction process, and greatly improves the accuracy and generalization of the webpage text extraction.
Description
Technical Field
The invention relates to the technical field of computers, in particular to a method and a device for identifying and processing a webpage text.
Background
The current web page text extraction mainly adopts DOM (Document Object Model ) to analyze web page HTML (HyperText Markup Language ) source code, and analyzes HTML into a tree structure. And extracting the text of the webpage based on a certain set rule by analyzing the tree structure. However, the HTML structure of web pages is diversified, and each web page is designed differently, for example, the website structure of e-commerce and news stories is quite different.
Therefore, the existing webpage text extraction method is low in stability, accurate in extraction of some types of webpages, inaccurate in extraction of other types of webpages, possible to extract some information of the edges of the webpages, and weak in generalization capability.
Disclosure of Invention
Because the existing method has the problems, the embodiment of the invention provides a method and a device for identifying and processing a webpage text.
In a first aspect, an embodiment of the present invention provides a method for identifying and processing a web page text, including:
acquiring a webpage source code of a webpage to be identified, and clearing all webpage labels in the webpage source code to obtain a webpage text comprising blank lines;
partitioning the webpage text according to blank lines to obtain a plurality of text blocks, wherein blank lines are arranged between the text blocks;
counting the number of characters of each character block, determining the boundary of the webpage text according to the number of characters of each character block, and identifying the webpage text in the webpage to be identified according to the boundary of the webpage text.
Optionally, determining the boundary of the web page text according to the number of the characters of each character block specifically includes:
according to the number of the characters of the current character block and the number of the characters of the last character block, calculating to obtain a difference value of the number of the characters of the current character block and the number of the characters of the last character block;
if the difference value of the number of words is larger than a threshold value, determining that the starting position of the current word block or the ending position of the last word block is the boundary of the text of the webpage.
Optionally, if the difference between the number of words is greater than a threshold, determining that the starting position of the current word block or the ending position of the last word block is the boundary of the text of the web page specifically includes:
if the difference value of the number of the characters is positive, determining that the starting position of the current character block is the starting boundary of the text of the webpage;
and if the difference value of the number of the characters is a negative number, determining the ending position of the last character block as the ending boundary of the text of the webpage.
Optionally, the identifying the webpage text in the webpage to be identified according to the boundary of the webpage text specifically includes:
recognizing characters between a starting boundary and a next ending boundary of each webpage text as text parts;
and combining all the text parts to obtain the webpage text in the webpage to be identified.
Optionally, the method for identifying and processing the webpage text further comprises the following steps:
if the number of the starting boundaries and the ending boundaries of the webpage text is equal and the interval between the starting boundaries and the ending boundaries appears, the starting boundaries and the ending boundaries of the webpage text are determined to be correctly identified.
Optionally, the threshold is determined according to the average number of blank blocks or the average number of text blocks;
wherein the average number of the blank blocks is the average value of the number of characters of each blank block;
the average number of the character blocks is the average value of the character number of each character block.
Optionally, the obtaining the web page source code of the web page to be identified, and removing all web page tags in the web page source code to obtain a web page text including blank lines specifically includes:
and crawling the webpage source codes of the webpages to be identified through the webcrawlers, identifying all webpage labels in the webpage source codes according to preset labels, and clearing the webpage labels to obtain the webpage text comprising blank lines.
In a second aspect, an embodiment of the present invention further provides a device for identifying and processing a web page text, including:
the tag clearing module is used for acquiring the webpage source codes of the webpages to be identified, clearing all webpage tags in the webpage source codes and obtaining webpage texts comprising blank rows;
the text block module is used for dividing the webpage text into blocks according to blank lines to obtain a plurality of text blocks, and blank lines are arranged among the text blocks;
the boundary determining module is used for counting the number of characters of each character block, determining the boundary of the webpage text according to the number of characters of each character block, and identifying the webpage text in the webpage to be identified according to the boundary of the webpage text.
Optionally, the boundary determining module is specifically configured to:
according to the number of the characters of the current character block and the number of the characters of the last character block, calculating to obtain a difference value of the number of the characters of the current character block and the number of the characters of the last character block;
if the difference value of the number of words is larger than a threshold value, determining that the starting position of the current word block or the ending position of the last word block is the boundary of the text of the webpage.
Optionally, the boundary determining module is specifically configured to:
if the difference value of the number of the characters is positive, determining that the starting position of the current character block is the starting boundary of the text of the webpage;
and if the difference value of the number of the characters is a negative number, determining the ending position of the last character block as the ending boundary of the text of the webpage.
Optionally, the boundary determining module is specifically configured to:
recognizing characters between a starting boundary and a next ending boundary of each webpage text as text parts;
and combining all the text parts to obtain the webpage text in the webpage to be identified.
Optionally, the device for identifying and processing the web page text further comprises:
and the boundary judging module is used for determining that the starting boundary and the ending boundary of the webpage text are correctly identified if the starting boundary and the ending boundary of the webpage text are equal in number and the interval between the starting boundary and the ending boundary appears.
Optionally, the threshold is determined according to the average number of blank blocks or the average number of text blocks;
wherein the average number of the blank blocks is the average value of the number of characters of each blank block;
the average number of the character blocks is the average value of the character number of each character block.
Optionally, the tag clearing module is specifically configured to:
and crawling the webpage source codes of the webpages to be identified through the webcrawlers, identifying all webpage labels in the webpage source codes according to preset labels, and clearing the webpage labels to obtain the webpage text comprising blank lines.
In a third aspect, an embodiment of the present invention further provides an electronic device, including:
at least one processor; and
at least one memory communicatively coupled to the processor, wherein:
the memory stores program instructions executable by the processor, which are called by the processor to perform the method described above.
In a fourth aspect, embodiments of the present invention also propose a non-transitory computer-readable storage medium storing a computer program, which causes the computer to carry out the above-mentioned method.
According to the technical scheme, the boundary of the webpage text is determined by counting the number of the characters of each character block, the webpage text in the webpage to be identified is identified according to the boundary, the method is suitable for extracting all types of webpage text, the extraction process is simple, and the accuracy and generalization of webpage text extraction are greatly improved.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions of the prior art, the drawings that are necessary for the description of the embodiments or the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention and that other drawings can be obtained from these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a flow chart of a method for identifying and processing a web page text according to an embodiment of the present invention;
FIG. 2 is a schematic diagram showing a comparison between before and after removing a web page tag in a web page source code according to an embodiment of the present invention;
FIG. 3 is a diagram showing the comparison of the difference in the number of words and the number of blank lines for each word block according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of a device for identifying and processing text of a web page according to an embodiment of the present invention;
fig. 5 is a logic block diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
The following describes the embodiments of the present invention further with reference to the accompanying drawings. The following examples are only for more clearly illustrating the technical aspects of the present invention, and are not intended to limit the scope of the present invention.
Fig. 1 shows a flow chart of a method for identifying and processing a web page text, which includes:
s101, acquiring a webpage source code of a webpage to be identified, and clearing all webpage labels in the webpage source code to obtain a webpage text comprising blank lines.
The webpage to be identified is a webpage in which all texts in the webpage need to be identified.
The source code of the web page is the source code used for writing the web page, and the source code displayed in the upper interface of fig. 2 is the source code of the web page.
The web page tag is a tag in the web page source code for identifying different source codes, such as < em >.
The web page text is the web page source code of which only text remains after the blank line is cleared, and the content displayed in the interface below the interface in fig. 2 is the web page text.
S102, partitioning the webpage text according to blank lines to obtain a plurality of text blocks, wherein blank lines are arranged between the text blocks.
The text blocks are blocks formed by a plurality of rows of text without blank lines, such as 4 text blocks in the lower interface of fig. 2.
S103, counting the number of characters of each character block, determining the boundary of the webpage text according to the number of characters of each character block, and identifying the webpage text in the webpage to be identified according to the boundary of the webpage text.
Wherein the word number is the total word number of the words in one word block.
And the boundary of the webpage text is the starting position and the ending position of the text part in the webpage.
The webpage text is a text part in the webpage to be identified, and does not comprise all levels of titles in the webpage to be identified.
The method and the device for identifying the webpage text in the webpage to be identified are suitable for extracting all types of webpage text by counting the number of characters of each character block, simple in extraction process and capable of greatly improving the accuracy and generalization of webpage text extraction.
Further, on the basis of the above method embodiment, determining the boundary of the web page text according to the number of words in each word block in S103 specifically includes:
and calculating the difference value of the number of words of the current word block and the last word block according to the number of words of the current word block and the number of words of the last word block.
If the difference value of the number of words is larger than a threshold value, determining that the starting position of the current word block or the ending position of the last word block is the boundary of the text of the webpage.
The word number difference is the difference between the word number of the current word block and the word number of the last word block.
The threshold is determined based on the average number of blank blocks or the average number of text blocks.
The average number of the blank blocks is the average value of the number of characters of each blank block.
The average number of the character blocks is the average value of the character number of each character block.
For example, if the average number of blank blocks is n, the threshold may be 2n; or, the average number of text blocks is m, the threshold may be 2m.
By calculating the difference of the number of words, the difference of the number of words between each word block can be obtained, when the difference is large, namely, the number of words of the current word block is much larger than that of the word of the last word block, the words of the current word block are much larger, the words of the last word block are little, the words of the last word block which are fewer words such as the title can be further explained, and the current word block is the text of the webpage.
Further, if the difference value of the number of words is greater than a threshold value, determining that the starting position of the current word block or the ending position of the last word block is the boundary of the text of the web page specifically includes:
and if the difference value of the number of the characters is a positive number, determining the starting position of the current character block as the starting boundary of the text of the webpage.
And if the difference value of the number of the characters is a negative number, determining the ending position of the last character block as the ending boundary of the text of the webpage.
Specifically, if the difference between the number of words is a positive number, the number of words of the current word block is larger than the number of words of the last word block, that is, the number of words of the current word block is larger, the number of words of the last word block is smaller, and it can be further stated that the starting position of the current word block is the starting boundary of the text of the web page.
Similarly, if the difference between the number of words is a negative number, the number of words of the current word block is much smaller than the number of words of the last word block, that is, the number of words of the current word block is little, the number of words of the last word block is much, and it can be further stated that the end position of the last word block is the end boundary of the text of the web page.
And the starting boundary and the ending boundary of the text of the webpage can be rapidly determined by judging the positive and negative of the difference value of the number of characters.
As shown in FIG. 2, the upper part is the HTML source code crawled by the crawler, and the lower part is the result of removing the HTML tags. For the removed result, the difference between the number of blank line blocks and the number of adjacent text blocks is counted, for example, there are 5 blank lines in some blank blocks, the text blocks above the blank blocks have 10 text, the text blocks below the blank blocks have 500 text, and the difference between the number of text blocks of the two adjacent text blocks is 490 (500-10=490). If the threshold is 300, the starting position of the lower text block is the starting boundary of the text of the web page. And further inquiring the starting boundary and the ending boundary of the text part to finally obtain the webpage text.
Further, on the basis of the above method embodiment, identifying the web page text in the web page to be identified according to the boundary of the web page text in S103 specifically includes:
recognizing characters between a starting boundary and a next ending boundary of each webpage text as text parts;
and combining all the text parts to obtain the webpage text in the webpage to be identified.
Specifically, the difference of the number of words in different word blocks is counted, as shown in fig. 3, the serial number of the abscissa is the serial number of each word block, each word block has two statistics items, the first is the difference of the number of words (the difference of the number of words in the text), and the second is the adjacent blank line number. It can be seen that the difference between the characters at 3 and 11 becomes larger and the number of blank lines is larger, which is the beginning of a text. 7 and 14 are similar except that the text difference is a negative number, this position is the end position of the text. Thus, two body parts can be found by FIG. 3, 3-7 being the first part, 11-14 being the second part, the sum of the two being the body of the entire web page.
The text parts can be quickly determined through the starting boundary and the ending boundary, and the web page text of the web page to be identified can be quickly obtained by combining the text parts.
Further, on the basis of the above method embodiment, the method for identifying and processing the web page text further includes:
and S104, if the number of the start boundaries and the end boundaries of the webpage text is equal and the start boundaries and the end boundaries are separated, determining that the start boundaries and the end boundaries of the webpage text are correctly identified.
Specifically, when the number of the recognized start boundaries and end boundaries is not equal, or at least 2 start boundaries appear continuously, or at least 2 end boundaries appear continuously, it is stated that the recognized start boundaries and end boundaries of the web page text are wrong. Therefore, it must be ensured that the identified start and end boundaries are correct to get the correct web page body.
Further, on the basis of the above method embodiment, S101 specifically includes:
and crawling the webpage source codes of the webpages to be identified through the webcrawlers, identifying all webpage labels in the webpage source codes according to preset labels, and clearing the webpage labels to obtain the webpage text comprising blank lines.
Specifically, firstly, crawling HTML source codes of a webpage by using a web crawler; then the HTML is de-labeled, the characters on the webpage are left, and all blank rows after the labels are removed are left at the same time; and then counting the distribution situation of the blank lines, namely blank line blocks. For example, there are 5 blank rows that are consecutive together, and then the 5 blank rows are blocks of one blank row. Meanwhile, counting the number of characters between each blank line block, namely a character block, and calculating the difference value of the number of characters between two adjacent character blocks; and finally, searching the beginning and ending positions of the text of the webpage, setting a threshold value, and selecting the position with more blank lines and large text difference as the beginning or ending boundary of part of the text. If the difference is positive, the boundary is a start boundary, otherwise, the boundary is an end boundary. All the body parts are combined together as the body of the web page in top-down order.
The method provided by the embodiment does not depend on DOM analysis, searches the boundary of the text by counting the blank area and the text density, greatly improves the generalization of text extraction, and has higher accuracy.
Fig. 4 shows a schematic structural diagram of an apparatus for identifying and processing a web page text according to the present embodiment, where the apparatus includes: a tag removal module 401, a text segmentation module 402, and a boundary determination module 403, wherein:
the tag clearing module 401 is configured to obtain a web page source code of a web page to be identified, clear all web page tags in the web page source code, and obtain a web page text including blank lines;
the text blocking module 402 is configured to block the web page text according to blank lines to obtain a plurality of text blocks, where blank lines are between the text blocks;
the boundary determining module 403 is configured to count the number of words in each word block, determine the boundary of the web page text according to the number of words in each word block, and identify the web page text in the web page to be identified according to the boundary of the web page text.
Specifically, the tag clearing module 401 obtains the web page source code of the web page to be identified, and clears all web page tags in the web page source code to obtain a web page text including blank lines; the text blocking module 402 blocks the web page text according to blank lines to obtain a plurality of text blocks, wherein blank lines are arranged between the text blocks; the boundary determining module 403 counts the number of words in each word block, determines the boundary of the web page text according to the number of words in each word block, and identifies the web page text in the web page to be identified according to the boundary of the web page text.
The method and the device for identifying the webpage text in the webpage to be identified are suitable for extracting all types of webpage text by counting the number of characters of each character block, simple in extraction process and capable of greatly improving the accuracy and generalization of webpage text extraction.
Further, on the basis of the above apparatus embodiment, the boundary determining module 403 is specifically configured to:
according to the number of the characters of the current character block and the number of the characters of the last character block, calculating to obtain a difference value of the number of the characters of the current character block and the number of the characters of the last character block;
if the difference value of the number of words is larger than a threshold value, determining that the starting position of the current word block or the ending position of the last word block is the boundary of the text of the webpage.
Further, on the basis of the above apparatus embodiment, the boundary determining module 403 is specifically configured to:
if the difference value of the number of the characters is positive, determining that the starting position of the current character block is the starting boundary of the text of the webpage;
and if the difference value of the number of the characters is a negative number, determining the ending position of the last character block as the ending boundary of the text of the webpage.
Further, on the basis of the above apparatus embodiment, the boundary determining module 403 is specifically configured to:
recognizing characters between a starting boundary and a next ending boundary of each webpage text as text parts;
and combining all the text parts to obtain the webpage text in the webpage to be identified.
Further, on the basis of the above device embodiment, the device for identifying and processing the web page text further includes:
and the boundary judging module is used for determining that the starting boundary and the ending boundary of the webpage text are correctly identified if the starting boundary and the ending boundary of the webpage text are equal in number and the interval between the starting boundary and the ending boundary appears.
Further, on the basis of the above device embodiment, the threshold is determined according to an average number of blank blocks or an average number of text blocks;
wherein the average number of the blank blocks is the average value of the number of characters of each blank block;
the average number of the character blocks is the average value of the character number of each character block.
Further, on the basis of the above device embodiment, the tag removing module 401 is specifically configured to:
and crawling the webpage source codes of the webpages to be identified through the webcrawlers, identifying all webpage labels in the webpage source codes according to preset labels, and clearing the webpage labels to obtain the webpage text comprising blank lines.
The device for identifying and processing the text of the web page in this embodiment may be used to execute the method embodiment, and its principle and technical effects are similar, and are not described herein again.
Referring to fig. 5, the electronic device includes: a processor (processor) 501, a memory (memory) 502, and a bus 503;
wherein, the liquid crystal display device comprises a liquid crystal display device,
the processor 501 and the memory 502 complete communication with each other via the bus 503;
the processor 501 is configured to invoke the program instructions in the memory 502 to perform the methods provided by the method embodiments described above.
The present embodiments disclose a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, are capable of performing the methods provided by the method embodiments described above.
The present embodiment provides a non-transitory computer-readable storage medium storing computer instructions that cause a computer to perform the methods provided by the above-described method embodiments.
The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.
It should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.
Claims (14)
1. The method for identifying and processing the webpage text is characterized by comprising the following steps of:
acquiring a webpage source code of a webpage to be identified, and clearing all webpage labels in the webpage source code to obtain a webpage text comprising blank lines;
partitioning the webpage text according to blank lines to obtain a plurality of text blocks, wherein blank lines are arranged between the text blocks;
counting the number of characters of each character block, determining the boundary of the webpage text according to the number of characters of each character block, and identifying the webpage text in the webpage to be identified according to the boundary of the webpage text;
the determining the boundary of the webpage text according to the text number of each text block specifically comprises the following steps:
according to the number of the characters of the current character block and the number of the characters of the last character block, calculating to obtain a difference value of the number of the characters of the current character block and the number of the characters of the last character block;
if the difference value of the number of the characters is larger than a threshold value, determining that the starting position of the current character block or the ending position of the last character block is the boundary of the text of the webpage; and determining the starting boundary and the ending boundary of the text of the webpage by judging the positive and negative of the difference value of the number of characters.
2. The method for recognizing and processing the text of the web page according to claim 1, wherein if the difference between the number of words is greater than a threshold, determining that the start position of the current word block or the end position of the last word block is the boundary of the text of the web page comprises:
if the difference value of the number of the characters is positive, determining that the starting position of the current character block is the starting boundary of the text of the webpage;
and if the difference value of the number of the characters is a negative number, determining the ending position of the last character block as the ending boundary of the text of the webpage.
3. The method for identifying and processing the web page text according to claim 2, wherein the identifying the web page text in the web page to be identified according to the boundary of the web page text specifically comprises:
recognizing characters between a starting boundary and a next ending boundary of each webpage text as text parts;
and combining all the text parts to obtain the webpage text in the webpage to be identified.
4. The web page text recognition processing method according to claim 2, wherein the web page text recognition processing method further comprises:
if the number of the starting boundaries and the ending boundaries of the webpage text is equal and the interval between the starting boundaries and the ending boundaries appears, the starting boundaries and the ending boundaries of the webpage text are determined to be correctly identified.
5. The method for recognizing and processing the text of the web page according to claim 1, wherein the threshold is determined according to an average number of blank blocks or an average number of text blocks;
wherein the average number of the blank blocks is the average value of the number of characters of each blank block;
the average number of the character blocks is the average value of the character number of each character block.
6. The method for recognizing and processing the text of a web page according to claim 1, wherein the steps of obtaining the web page source code of the web page to be recognized, and removing all the web page tags in the web page source code to obtain the web page text including blank lines comprise:
and crawling the webpage source codes of the webpages to be identified through the webcrawlers, identifying all webpage labels in the webpage source codes according to preset labels, and clearing the webpage labels to obtain the webpage text comprising blank lines.
7. An identification processing device for web page text, characterized by comprising:
the tag clearing module is used for acquiring the webpage source codes of the webpages to be identified, clearing all webpage tags in the webpage source codes and obtaining webpage texts comprising blank rows;
the text block module is used for dividing the webpage text into blocks according to blank lines to obtain a plurality of text blocks, and blank lines are arranged among the text blocks;
the boundary determining module is used for counting the number of characters of each character block, determining the boundary of the webpage text according to the number of characters of each character block, and identifying the webpage text in the webpage to be identified according to the boundary of the webpage text;
the boundary determining module is specifically configured to:
according to the number of the characters of the current character block and the number of the characters of the last character block, calculating to obtain a difference value of the number of the characters of the current character block and the number of the characters of the last character block;
if the difference value of the number of the characters is larger than a threshold value, determining that the starting position of the current character block or the ending position of the last character block is the boundary of the text of the webpage; and determining the starting boundary and the ending boundary of the text of the webpage by judging the positive and negative of the difference value of the number of characters.
8. The apparatus for recognizing and processing a web page text according to claim 7, wherein the boundary determining module is specifically configured to:
if the difference value of the number of the characters is positive, determining that the starting position of the current character block is the starting boundary of the text of the webpage;
and if the difference value of the number of the characters is a negative number, determining the ending position of the last character block as the ending boundary of the text of the webpage.
9. The apparatus for recognizing and processing a web page text according to claim 8, wherein the boundary determining module is specifically configured to:
recognizing characters between a starting boundary and a next ending boundary of each webpage text as text parts;
and combining all the text parts to obtain the webpage text in the webpage to be identified.
10. The apparatus according to claim 8, wherein the apparatus further comprises:
and the boundary judging module is used for determining that the starting boundary and the ending boundary of the webpage text are correctly identified if the starting boundary and the ending boundary of the webpage text are equal in number and the interval between the starting boundary and the ending boundary appears.
11. The apparatus according to claim 7, wherein the threshold is determined according to an average number of blank blocks or an average number of text blocks;
wherein the average number of the blank blocks is the average value of the number of characters of each blank block;
the average number of the character blocks is the average value of the character number of each character block.
12. The apparatus for recognizing and processing a web page text according to claim 7, wherein the tag removing module is specifically configured to:
and crawling the webpage source codes of the webpages to be identified through the webcrawlers, identifying all webpage labels in the webpage source codes according to preset labels, and clearing the webpage labels to obtain the webpage text comprising blank lines.
13. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of identifying the text of a web page as claimed in any one of claims 1 to 6 when the program is executed by the processor.
14. A non-transitory computer-readable storage medium having stored thereon a computer program, wherein the computer program, when executed by a processor, implements the method of identifying a web page body as claimed in any one of claims 1 to 6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910945459.1A CN110795933B (en) | 2019-09-30 | 2019-09-30 | Webpage text recognition processing method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910945459.1A CN110795933B (en) | 2019-09-30 | 2019-09-30 | Webpage text recognition processing method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110795933A CN110795933A (en) | 2020-02-14 |
CN110795933B true CN110795933B (en) | 2023-10-31 |
Family
ID=69438918
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910945459.1A Active CN110795933B (en) | 2019-09-30 | 2019-09-30 | Webpage text recognition processing method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110795933B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112230989B (en) * | 2020-12-14 | 2021-03-12 | 北京智慧星光信息技术有限公司 | Webpage channel navigation bar extraction method, system, electronic equipment and storage medium |
CN113537091B (en) * | 2021-07-20 | 2024-05-03 | 东莞盟大集团有限公司 | Webpage text recognition method and device, electronic equipment and storage medium |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104598577A (en) * | 2015-01-14 | 2015-05-06 | 晶赞广告(上海)有限公司 | Extraction method for webpage text |
CN105022803A (en) * | 2015-07-01 | 2015-11-04 | 广州市万隆证券咨询顾问有限公司 | Method and system for extracting text content of webpage |
CN105183801A (en) * | 2015-08-25 | 2015-12-23 | 北京信息科技大学 | Web page body text extraction method and apparatus |
CN105512225A (en) * | 2015-11-30 | 2016-04-20 | 北大方正集团有限公司 | Method and device extracting main content from webpage |
WO2017080090A1 (en) * | 2015-11-14 | 2017-05-18 | 孙燕群 | Extraction and comparison method for text of webpage |
CN108763591A (en) * | 2018-06-21 | 2018-11-06 | 湖南星汉数智科技有限公司 | A kind of webpage context extraction method, device, computer installation and computer readable storage medium |
CN109086361A (en) * | 2018-07-20 | 2018-12-25 | 北京开普云信息科技有限公司 | A kind of automatic abstracting method of webpage article information and system based on mutual information between web page joint |
CN109543126A (en) * | 2018-11-19 | 2019-03-29 | 四川长虹电器股份有限公司 | Web page text information extracting method based on block text accounting |
-
2019
- 2019-09-30 CN CN201910945459.1A patent/CN110795933B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104598577A (en) * | 2015-01-14 | 2015-05-06 | 晶赞广告(上海)有限公司 | Extraction method for webpage text |
CN105022803A (en) * | 2015-07-01 | 2015-11-04 | 广州市万隆证券咨询顾问有限公司 | Method and system for extracting text content of webpage |
CN105183801A (en) * | 2015-08-25 | 2015-12-23 | 北京信息科技大学 | Web page body text extraction method and apparatus |
WO2017080090A1 (en) * | 2015-11-14 | 2017-05-18 | 孙燕群 | Extraction and comparison method for text of webpage |
CN105512225A (en) * | 2015-11-30 | 2016-04-20 | 北大方正集团有限公司 | Method and device extracting main content from webpage |
CN108763591A (en) * | 2018-06-21 | 2018-11-06 | 湖南星汉数智科技有限公司 | A kind of webpage context extraction method, device, computer installation and computer readable storage medium |
CN109086361A (en) * | 2018-07-20 | 2018-12-25 | 北京开普云信息科技有限公司 | A kind of automatic abstracting method of webpage article information and system based on mutual information between web page joint |
CN109543126A (en) * | 2018-11-19 | 2019-03-29 | 四川长虹电器股份有限公司 | Web page text information extracting method based on block text accounting |
Non-Patent Citations (2)
Title |
---|
廖建军 ; .基于标签样式和密度模型的网页正文自动抽取.情报科学.2018,(07),全文. * |
沈劲枝 ; 寇文波 ; 田晨耕 ; .基于特征定位边界预测的Web档案正文采集.现代图书情报技术.2009,(12),全文. * |
Also Published As
Publication number | Publication date |
---|---|
CN110795933A (en) | 2020-02-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US8819028B2 (en) | System and method for web content extraction | |
CN102541874B (en) | Webpage text content extracting method and device | |
CN104598577B (en) | A kind of extracting method of Web page text | |
CN108920434B (en) | Universal webpage theme content extraction method and system | |
CN110390038B (en) | Page blocking method, device and equipment based on DOM tree and storage medium | |
WO2020000717A1 (en) | Web page classification method and device, and computer-readable storage medium | |
CN105320734B (en) | A kind of web page core content extracting method | |
CN105630941A (en) | Statistics and webpage structure based Wen body text content extraction method | |
EP3349124A1 (en) | Method and system for generating parsed document from digital document | |
CN110795933B (en) | Webpage text recognition processing method and device | |
CN103605691A (en) | Device and method used for processing issued contents in social network | |
CN111797356B (en) | Webpage form information extraction method and device | |
CN109144513B (en) | Method for automatically extracting list page | |
CN112650910A (en) | Method, device, equipment and storage medium for determining website update information | |
CN106227770A (en) | A kind of intelligentized news web page information extraction method | |
CN104615728B (en) | A kind of webpage context extraction method and device | |
CN108694192B (en) | Webpage type judging method and device | |
US20130167018A1 (en) | Methods and Devices for Extracting Document Structure | |
CN112232075A (en) | Article release time identification method based on time format and webpage element characteristics | |
CN109472020A (en) | A kind of feature alignment Chinese word cutting method | |
CN103455572A (en) | Method and device for acquiring movie and television subjects from web pages | |
CN106897287B (en) | Webpage release time extraction method and device for webpage release time extraction | |
CN115391711B (en) | Webpage text information extraction method, device, equipment and medium | |
CN115238078A (en) | Webpage information extraction method, device, equipment and storage medium | |
CN112559929B (en) | Method, electronic device and medium for extracting webpage target information |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB02 | Change of applicant information | ||
CB02 | Change of applicant information |
Address after: Room 332, 3 / F, Building 102, 28 xinjiekouwei street, Xicheng District, Beijing 100088 Applicant after: Qianxin Technology Group Co.,Ltd. Applicant after: Qianxin Wangshen information technology (Beijing) Co.,Ltd. Address before: Room 332, 3 / F, Building 102, 28 xinjiekouwei street, Xicheng District, Beijing 100088 Applicant before: Qianxin Technology Group Co.,Ltd. Applicant before: LEGENDSEC INFORMATION TECHNOLOGY (BEIJING) Inc. |
|
GR01 | Patent grant | ||
GR01 | Patent grant |