CN113537091A

CN113537091A - Webpage text recognition method and device, electronic equipment and storage medium

Info

Publication number: CN113537091A
Application number: CN202110823007.3A
Authority: CN
Inventors: 余良
Original assignee: Dongguan Mengda Plasticizing Science & Technology Co ltd
Current assignee: Dongguan Mengda Plasticizing Science & Technology Co ltd
Priority date: 2021-07-20
Filing date: 2021-07-20
Publication date: 2021-10-22
Anticipated expiration: 2041-07-20
Also published as: CN113537091B

Abstract

The application discloses a method and a device for identifying a webpage text, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring a webpage text to be analyzed, wherein the webpage text to be analyzed comprises character lines and blank lines, a single blank line and a plurality of continuously arranged blank lines are regarded as intervals, and the number of the blank lines corresponding to the intervals represents the length of the intervals; calculating to obtain a reference interval length according to all intervals of the webpage text to be analyzed; filtering all intervals of the webpage text to be analyzed by using the reference interval length so as to reserve the interval with the length greater than the reference interval length; and exhaustively searching the corresponding character number between any two lines and all the filtered intervals, and determining the webpage text according to the corresponding character number of each search area and all the filtered intervals. The method and the device have high identification accuracy and can be suitable for various types of webpages.

Description

Webpage text recognition method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of web page design technologies, and in particular, to a method and an apparatus for identifying a text of a web page, an electronic device, and a storage medium.

Background

For a web page, the core content is typically the body of the web page. At present, most web pages not only contain texts, but also comprise contents such as tags, advertisements, web page links, plug-ins and the like, however, to quickly acquire core contents of the web pages, the contents outside the texts of the web pages need to be removed, and meanwhile, the quality of the extracted text contents directly influences the information available to a browser.

At present, a method for identifying a text of a web page mainly adopts a method for analyzing a source code of a web page HTML (HyperText Markup Language). Extracting the main body of the webpage through the HTML source code of the webpage and based on a certain set rule, for example, the number of characters is the most, and the area is demarcated. However, the design of each webpage is different, so that the method has high error rate and cannot adapt to various types of webpages.

Disclosure of Invention

The present application aims to solve the above technical problem, and provide a method, an apparatus, an electronic device and a storage medium for identifying a web page text, which not only have a high identification accuracy, but also are suitable for various types of web pages.

In order to achieve the above object, the present application discloses a method for identifying a text of a web page, which includes:

Acquiring a webpage text to be analyzed, wherein the webpage text to be analyzed comprises character lines and blank lines, a single blank line and a plurality of continuously arranged blank lines are regarded as intervals, and the number of the blank lines corresponding to the intervals represents the length of the intervals;

calculating to obtain a reference interval length according to all intervals of the webpage text to be analyzed;

filtering all intervals of the webpage text to be analyzed by using the reference interval length so as to reserve the interval with the length greater than the reference interval length;

and exhaustively searching the corresponding character number between any two lines and all the filtered intervals, and determining the webpage text according to the corresponding character number of each search area and all the filtered intervals.

Optionally, the reference interval length is an average interval length of all intervals of the web page text to be analyzed.

Optionally, the determining the text of the web page according to the number of characters corresponding to each search area and all filtered intervals includes:

calculating the ratio of the number of the characters corresponding to each search area to the reference value to obtain the character density corresponding to each search area;

extracting a search area corresponding to the maximum character density as a webpage text;

and the reference value is obtained according to all the filtered intervals of the search areas, and the reference value is the minimum when the number of the filtered intervals of the search areas is zero.

Optionally, the reference value is the sum of the lengths of all the filtered intervals of each search area.

In order to achieve the above object, the present application also discloses an apparatus for identifying a text of a web page, comprising:

the analysis device comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a webpage text to be analyzed, the webpage text to be analyzed comprises character lines and blank lines, a single blank line and a plurality of continuously arranged blank lines are regarded as intervals, and the number of the blank lines corresponding to the intervals represents the length of the intervals;

the calculation module is used for calculating to obtain a reference interval length according to all intervals of the webpage text to be analyzed;

a filtering module, configured to filter all intervals of the web page text to be analyzed by using the reference interval length, so as to reserve the interval with a length greater than the reference interval length;

and the searching and determining module is used for exhaustively searching the corresponding character number between any two lines and all the filtered intervals, and determining the webpage text according to the corresponding character number of each searching area and all the filtered intervals.

In order to achieve the above object, the present application also discloses an electronic device, comprising:

a processor;

a memory having stored therein executable instructions of the processor;

wherein the processor is configured to perform the method of identifying a body of a web page as described above via execution of the executable instructions.

In order to achieve the above object, the present application also discloses a computer readable storage medium having a computer program stored thereon, wherein the computer program is configured to implement the method for identifying a text of a web page as described above when executed by a processor.

According to the method and the device, the reference interval length is obtained through calculation according to all intervals of the webpage text to be analyzed, all the intervals are filtered by utilizing the reference interval length, and for the webpage text with two boundary intervals, if the webpage text has an internal interval, the webpage text can be completely or greatly filtered out in general, so that gaps among paragraphs can be eliminated or reduced, the text part of the webpage text is more concentrated, and the boundary intervals of the starting position and the ending position of the webpage text can be basically reserved due to the fact that the lengths are larger. Therefore, the text of the webpage can be easily determined by utilizing the number of characters corresponding to each search area and all filtered intervals, the identification accuracy is improved, and the method and the device can be suitable for various types of webpages.

Drawings

Fig. 1 is a schematic flowchart of a method for identifying a text of a web page according to an embodiment of the present application.

Fig. 2 is a schematic block diagram of an apparatus for recognizing text of a web page according to an embodiment of the present application.

Fig. 3 is a schematic block diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In order to explain technical contents, structural features, implementation principles, and objects and effects of the present application in detail, the following detailed description is given with reference to the accompanying drawings in combination with the embodiments.

Referring to fig. 1, the present application discloses a method for identifying a text of a web page, which includes:

101. the method comprises the steps of obtaining a webpage text to be analyzed, wherein the webpage text to be analyzed comprises character lines and blank lines, a single blank line and a plurality of continuously arranged blank lines are regarded as intervals, and the number of the blank lines corresponding to the intervals represents the length of the intervals.

Specifically, the acquiring of the web page text to be analyzed includes:

acquiring a webpage source code;

eliminating the webpage labels of the webpage source codes;

and obtaining the webpage text to be analyzed.

Generally, web page tags are mainly HTML tags, which are used for web page development, mainly for providing tags, do not bring actual content, and occupy a part of lines in web page text. Therefore, the webpage labels in the webpage source codes are deleted, so that the information irrelevant to the webpage text is reduced, the subsequent information amount to be identified can be reduced, and the identification efficiency is improved.

Of course, the "web page text to be analyzed" is not limited to the above-described generation manner. The method may be any method as long as the method can be used for identifying the text of the webpage according to the webpage text to be analyzed.

Specifically, a blank line refers to the number of lines without characters, and a character line refers to the number of lines containing characters. The web page text usually has a plurality of intervals, and the intervals actually correspond to a single or a plurality of blank lines, and the length of the interval is one blank line for the interval corresponding to the single blank line. For an interval corresponding to a plurality of blank lines arranged in succession, the length of the interval is the number of blank lines corresponding to the interval, for example, the length of the interval corresponding to three blank lines arranged in succession is three blank lines.

102. And calculating to obtain a reference interval length according to all intervals of the webpage text to be analyzed.

Wherein, the reference interval length is set to screen all intervals mainly according to the reference interval length. The calculation method of the "reference interval length" is not limited in the present invention. For example, the average value of all intervals of the web page text to be analyzed may be calculated, or the median value of all intervals of the web page text to be analyzed may be calculated, or of course, the average value may be in other specific forms as long as the method for identifying the web page text according to the present invention can be performed by using the "reference interval length".

In some embodiments, the reference interval length is an average interval length of all intervals of the web page text to be analyzed.

To illustrate how to be the average interval length, assuming that the web page text to be analyzed contains 6 intervals, the length of the interval a1 is 3, the length of the interval a2 is 5, the length of the interval A3 is 1, the length of the interval a4 is 1, the length of the interval a5 is 6, and the length of the interval a6 is 2, the average interval of the web page text to be analyzed is 3. It should be noted that the specific number of intervals and the length of the intervals of the web page text to be analyzed are only examples for easy understanding, and the present invention is not limited thereto.

103. All intervals of the web page text to be analyzed are filtered by the reference interval length to reserve intervals having a length greater than the reference interval length.

Generally speaking, the total number of characters corresponding to the body area of the web page is the largest, and the intervals in the body area of the web page are mainly intervals which may exist between paragraphs or between secondary titles or lower-level titles and paragraphs, and the lengths of the intervals are usually smaller. A demarcation interval is usually set at the beginning and end of the web page text area, and the demarcation interval usually has a larger length than the internal interval that may exist in the web page text area. The web page text is filtered by using the reference interval length, the interval with the length less than or equal to the reference interval length is eliminated by reserving the interval with the length greater than the reference interval length, and the interval between paragraphs or the interval between secondary titles or lower-level titles (the internal interval of the web page text) can be completely or greatly eliminated, so that the paragraphs are more compact, the characters are more concentrated, and the demarcation interval is not filtered due to the greater length, thereby facilitating the subsequent identification of the web page text.

Continuing with the description of the specific number of intervals and the length of intervals of the web page text to be analyzed in the above example, since the length of interval a1 is 3, the length of interval a2 (demarcation interval) is 5, the length of interval A3 (internal interval) is 1, the length of interval a4 (internal interval) is 1, the length of interval a5 (demarcation interval) is 6, the length of interval a6 is 2, and the average interval of the web page text to be analyzed is 3, all intervals are filtered by the average interval length to obtain filtered intervals a2 and a5, the length of interval a2 is 5, and the length of interval a5 is 6.

104. The corresponding character number between any two lines and all the intervals after filtering are searched exhaustively, and the web page text is determined according to the corresponding character number of each search area (namely the area between any two lines) and all the intervals after filtering.

Generally speaking, because the number of characters corresponding to the text area of the web page is the largest, and after all the intervals are filtered by using the reference interval length, the character concentration of the text area of the web page is further improved, and the boundary interval between the starting position and the ending position of the text of the web page is reserved, the text of the web page is easily determined according to the number of characters corresponding to each search area and all the filtered intervals, and the accuracy is high.

Specifically, the exhaustive search for the number of characters corresponding to any two lines and all the intervals after filtering includes:

and taking the total line number m of the webpage text to be analyzed as an exhaustive search range, and exhaustively searching the corresponding character number between a first line number i and a second line number j and all the intervals after filtering, wherein the first line number i is greater than or equal to 0 and the first line number i is less than the second line number j, and the second line number j is less than or equal to the total line number m.

In some embodiments, the determining the text of the web page according to the number of characters corresponding to each search area and all filtered intervals includes:

and the reference value is obtained according to all the filtered intervals of the search areas, and when the number of the filtered intervals of the search areas is zero, the reference value is the minimum.

Since the character density is generally the ratio of the number of characters to the corresponding number of lines, but there are many lines of characters and few lines of characters in the web page, for example, only a few characters may inevitably appear in the last line of the paragraph. Therefore, the character density is calculated by selecting the line number, which is easily interfered by the character number of each line, especially the line number with smaller character number, so that the calculated character density is smaller. And the reference value corresponding to each search area is introduced to calculate the character density, so that the influence of the line number of smaller or more characters can be basically avoided. Specifically, the reference value is obtained according to the filtered interval, generally speaking, the number of characters in the text area of the web page is the largest, if the internal interval exists in the text area of the web page, the internal interval can be completely or greatly filtered, and the reference value corresponding to the text area of the web page is relatively smaller, so that when the character density is calculated by introducing the ratio of the number of characters to the reference value, the character density corresponding to the text area of the web page is basically the largest, and the identification method of the text of the web page has higher accuracy. In addition, when the search area has no filtered interval, the reference value of the search area is set to be the minimum, for example, the reference value is set to 1, so when calculating the character density, the character density of the search area is the number of characters in the search area.

In some embodiments, the reference value is the sum of the lengths of all the intervals filtered for each search area.

Generally, because the boundary interval between the initial position and the ending position of the webpage text has a larger length, the reference value is set as the sum of the lengths of all intervals after filtering each search area, and when the character density of each search area is calculated by using the ratio of the number of characters to the reference value, the webpage text area can be obviously distinguished from other areas, thereby being beneficial to improving the accuracy of identifying the webpage text.

Specifically, when there is no corresponding filtered interval in the search area, the character density of the search area is the number of characters in the search area, i.e., the reference value is set to 1 (minimum reference value). Of course, this is merely the arrangement in the specific example, and is not so limited.

Continuing with the above example with reference to the specific number of intervals and the length of intervals of the web page text to be analyzed, assume that the total number of lines of 30 lines of web page text to be analyzed contains 6 intervals, the length of interval a1 is 3, the length of interval a2 (demarcation interval) is 5, the length of interval A3 is 1 (internal interval), the length of interval a4 is 1 (internal interval), the length of interval a5 (demarcation interval) is 6, the length of interval a6 is 2, the average interval of web page text to be analyzed is 3, the filtered intervals are a2 and a5, the length of interval a2 is 5, and the length of interval a5 is 6.

And taking the total line number 30 as the range of exhaustive search, exhaustively searching the corresponding character number between a first line number i and a second line number j and all the filtered intervals, and calculating the corresponding character density according to the ratio of the character number corresponding to each search area to the sum of all the filtered interval lengths, wherein the first line number i is greater than or equal to 0, the first line number i is less than the second line number j, and the second line number j is less than or equal to the total line number 30.

After exhaustive search, the character densities corresponding to the search regions are compared, and it should be noted that only a part of the search regions is used for explanation only for convenience of illustration. Assume that search region B1 is from the start to the end of interval a2, search region B1 has 10 characters, search region B2 is from interval a2 to interval a5 (excluding interval a2 and interval a5 itself), search region B2 has 300 characters, search region B3 is from interval a5 to the end, and search region B3 has 60 characters. Search area B4 starts at interval a5, search area B5 ends at interval a2, and search area B6 ends at start. Since the filtered intervals are interval a2 and interval a5, and the length of interval a2 is 5 and the length of interval a5 is 6, the reference value is 1 and the character density of search area B2 is 300/1 equal to 300 since there is no filtered interval in search area B2. The character density of the search area B4 is 310/5-62, and the character density of the search area B5 is 360/6-60. Generally speaking, the number of characters in the text area of the web page is the largest, and meanwhile, the internal interval of the text area of the web page can be basically filtered completely or to a large extent, and the reference value is usually smaller, so that when the character density is calculated by introducing the ratio of the number of characters to the reference value, the character density corresponding to the text area of the web page is the largest basically, and the maximum character density search area B2 is extracted as the text of the web page. It should be noted that the specific number of intervals, interval lengths, line numbers and character numbers corresponding to the text of the web page to be analyzed are only examples for easy understanding, and the present invention is not limited thereto.

According to the method and the device, the reference interval length is obtained through calculation according to all intervals of the webpage text to be analyzed, all the intervals are filtered by utilizing the reference interval length, and for the webpage text with two boundary intervals, if the webpage text has an internal interval, the webpage text can be completely or greatly filtered out in general, so that gaps among paragraphs can be eliminated or reduced, the text part of the webpage text is more concentrated, and the boundary intervals of the starting position and the ending position of the webpage text can be basically reserved due to the fact that the lengths are larger. Besides, other intervals which are equal to or larger than the boundary interval length may exist outside the web page text, so that the accuracy of recognition is influenced to a certain extent by using two boundary intervals to recognize the web page text, and meanwhile, the number of characters corresponding to each search area and all the intervals after filtering are used, so that the web page text is easy to determine, the recognition accuracy is improved, and the method is also suitable for various types of web pages.

Referring to fig. 2, an embodiment of the present application further discloses a device for identifying a text of a web page, which includes an obtaining module 10, a calculating module 11, a filtering module 12, and a searching and determining module 13.

The obtaining module 10 is configured to obtain a web page text to be analyzed, where the web page text to be analyzed includes character lines and blank lines, a single blank line and a plurality of continuously arranged blank lines are regarded as an interval, and the number of blank lines corresponding to the interval indicates the length of the interval.

The calculation module 11 is configured to calculate a reference interval length according to all intervals of the web page text to be analyzed.

The filtering module 12 is configured to filter all intervals of the web page text to be analyzed by using the reference interval length, so as to reserve intervals with a length greater than the reference interval length.

The searching and determining module 13 is configured to search for the number of characters corresponding to any two lines and all filtered intervals exhaustively, and determine the text of the web page according to the number of characters corresponding to each search area and all filtered intervals.

Further, the reference interval length is an average interval length of all intervals of the web page text to be analyzed.

Further, determining the web page text according to the number of characters corresponding to each search area and all filtered intervals, including:

Further, the reference value is the sum of the lengths of all the intervals filtered by each search area.

For specific description of the device for identifying the text of the web page, the method for identifying the text of the web page is described in detail above, and will not be described herein again.

Referring to fig. 3, an embodiment of the present application further discloses an electronic device, which includes:

a processor 21;

a memory 20 having stored therein executable instructions of the processor 21;

wherein the processor 21 is configured to execute the above-mentioned identification method of the body of the web page via executing the executable instructions.

The embodiment of the application also discloses a computer readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the method for identifying the text of the webpage is realized.

The embodiment of the application also discloses a computer program product or a computer program, which comprises computer instructions, and the computer instructions are stored in a computer readable storage medium. The processor of the electronic device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the electronic device executes the method for identifying the text of the webpage.

It should be understood that in the embodiments of the present Application, the Processor may be a Central Processing Unit (CPU), and the Processor may also be other general purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, and the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware associated with computer program instructions, and the programs can be stored in a computer readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

The above disclosure is only a preferred embodiment of the present application and should not be taken as limiting the scope of the present application, so that the claims of the present application are covered by the appended claims.

Claims

1. A method for identifying a webpage text is characterized by comprising the following steps:

2. The method for identifying the text of a web page of claim 1,

the reference interval length is an average interval length of all intervals of the web page text to be analyzed.

3. The method for identifying the text of a web page of claim 1,

determining the web page text according to the number of characters corresponding to each search area and all filtered intervals, including:

4. A method of identifying the text of a web page as recited in claim 3,

the reference value is the sum of the lengths of all the intervals filtered by each search area.

5. An apparatus for recognizing text of a web page, comprising:

6. The apparatus for recognizing text of web page according to claim 5,

7. The apparatus for recognizing text of web page according to claim 5,

8. The apparatus for recognizing text of web page according to claim 7,

9. An electronic device, comprising:

a processor;

a memory having stored therein executable instructions of the processor;

wherein the processor is configured to perform the method of identifying a body of a web page of any one of claims 1-4 via execution of the executable instructions.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out a method for identifying the text of a web page according to any one of claims 1 to 4.