CN113537091B

CN113537091B - Webpage text recognition method and device, electronic equipment and storage medium

Info

Publication number: CN113537091B
Application number: CN202110823007.3A
Authority: CN
Inventors: 余良
Original assignee: Dongguan Mengda Group Co ltd
Current assignee: Dongguan Mengda Group Co ltd
Priority date: 2021-07-20
Filing date: 2021-07-20
Publication date: 2024-05-03
Anticipated expiration: 2041-07-20
Also published as: CN113537091A

Abstract

The application discloses a webpage text identification method, a device, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring a webpage text to be analyzed, wherein the webpage text to be analyzed comprises character rows and blank rows, a single blank row and a plurality of continuously arranged blank rows are regarded as intervals, and the number of the blank rows corresponding to the intervals represents the length of the intervals; calculating to obtain a reference interval length according to all intervals of the webpage text to be analyzed; filtering all intervals of the webpage text to be analyzed by using the reference interval length so as to reserve intervals with the length larger than the reference interval length; and searching the corresponding character number between any two lines and all the filtered intervals in an exhaustive way, and determining the text of the webpage according to the corresponding character number of each search area and all the filtered intervals. The method has higher recognition accuracy and can be suitable for various web pages.

Description

Webpage text recognition method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of web page design technologies, and in particular, to a method and apparatus for identifying a web page text, an electronic device, and a storage medium.

Background

For a web page, the core content is typically the body of the web page. Most of the webpages now contain text, but also content such as labels, advertisements, webpage links, plug-ins and the like, however, when core content of the webpages is to be obtained quickly, the content outside the text of the webpages needs to be removed, and meanwhile, the quality of the extracted text content directly influences the information available to a browser.

At present, the identification method of the webpage text mainly adopts a method for analyzing the HTML (HyperText Markup Language ) source code of the webpage. And extracting the text of the webpage based on a certain set rule, for example, the maximum number of characters and the regional demarcation, through the HTML source code of the webpage. However, the design of each webpage is not the same, so that the method has high error rate and cannot adapt to various types of webpages.

Disclosure of Invention

The application aims to provide a method, a device, electronic equipment and a storage medium for identifying a webpage text, which are used for solving the technical problems, and have high identification accuracy and are suitable for various webpages.

In order to achieve the above object, the present application discloses a method for identifying a web page text, which includes:

Acquiring a webpage text to be analyzed, wherein the webpage text to be analyzed comprises character rows and blank rows, a single blank row and a plurality of continuously arranged blank rows are regarded as intervals, and the number of the blank rows corresponding to the intervals represents the length of the intervals;

calculating to obtain a reference interval length according to all intervals of the webpage text to be analyzed;

Filtering all intervals of the webpage text to be analyzed by utilizing the reference interval length so as to reserve the intervals with the length larger than the reference interval length;

And searching the corresponding character number between any two lines and all the filtered intervals in an exhaustive way, and determining the text of the webpage according to the corresponding character number of each search area and all the filtered intervals.

Optionally, the reference interval length is an average interval length of all intervals of the web page text to be analyzed.

Optionally, the determining the text of the web page according to the number of characters corresponding to each search area and all the filtered intervals includes:

Calculating the ratio of the number of characters corresponding to each search area to the reference value to obtain the density of the characters corresponding to each search area;

Extracting a search area corresponding to the maximum character density as a webpage text;

the reference value is obtained according to all the intervals after filtering of each search area, and the reference value is the minimum when the number of the intervals after filtering of each search area is zero.

Optionally, the reference value is a sum of lengths of all intervals after filtering of each search area.

In order to achieve the above object, the present application also discloses a device for identifying a text of a web page, which includes:

The system comprises an acquisition module, a display module and a display module, wherein the acquisition module is used for acquiring a webpage text to be analyzed, the webpage text to be analyzed comprises character rows and blank rows, a single blank row and a plurality of continuously arranged blank rows are regarded as intervals, and the number of the blank rows corresponding to the intervals represents the length of the intervals;

the calculation module is used for calculating and obtaining a reference interval length according to all intervals of the webpage text to be analyzed;

the filtering module is used for filtering all intervals of the webpage text to be analyzed by utilizing the reference interval length so as to reserve the intervals with the length larger than the reference interval length;

the searching and determining module is used for searching the corresponding character number between any two rows and all the filtered intervals in an exhaustive way, and determining the text of the webpage according to the corresponding character number of each searching area and all the filtered intervals.

In order to achieve the above object, the present application also discloses an electronic device, which includes:

A processor;

a memory having stored therein executable instructions of the processor;

Wherein the processor is configured to perform the method of identifying a web page text as described above via execution of the executable instructions.

In order to achieve the above object, the present application also discloses a computer readable storage medium having stored thereon a computer program, characterized in that the computer program, when executed by a processor, implements the method for identifying a text of a web page as described above.

According to the method, the reference interval length is obtained through calculation according to all intervals of the webpage text to be analyzed, the reference interval length is utilized to filter all intervals, if the webpage text with two demarcation intervals exists, the webpage text can be completely or largely filtered under normal conditions, gaps between paragraphs can be eliminated or reduced, the text parts of the webpage text are more concentrated, and the demarcation intervals of the starting position and the ending position of the webpage text are basically reserved due to the fact that the lengths are larger. Therefore, the text of the webpage is easy to determine by utilizing the number of characters corresponding to each search area and all the filtered intervals, the recognition accuracy is improved, and the method is also suitable for various webpages.

Drawings

Fig. 1 is a flow chart of a method for identifying a web page text according to an embodiment of the application.

Fig. 2 is a schematic block diagram of a device for identifying text of a web page according to an embodiment of the present application.

Fig. 3 is a schematic block diagram of an electronic device according to an embodiment of the application.

Detailed Description

In order to describe the technical content, structural features, implementation principles and achieved objects and effects of the present application in detail, the following description is made in connection with the embodiments and the accompanying drawings.

Referring to fig. 1, the application discloses a method for identifying a web page text, which comprises the following steps:

101. The method comprises the steps of obtaining a webpage text to be analyzed, wherein the webpage text to be analyzed comprises character rows and blank rows, a single blank row and a plurality of continuously arranged blank rows are regarded as intervals, and the number of the blank rows corresponding to the intervals represents the length of the intervals.

Specifically, the acquiring the webpage text to be analyzed includes:

Acquiring a webpage source code;

removing the webpage labels of the webpage source codes;

And obtaining the webpage text to be analyzed.

Generally, the web page tags are mainly HTML tags, which are used for developing web pages, mainly for providing tags, and do not bring actual content, and occupy part of the number of lines in the text of the web page. Therefore, deleting the webpage label in the webpage source code is beneficial to reducing information irrelevant to the webpage text, reducing the information quantity required to be identified subsequently and improving the identification efficiency.

Of course, the "web page text to be analyzed" is not limited to the above-described generation manner. The identification method of the webpage text can be executed according to the webpage text to be analyzed.

Specifically, blank lines refer to the number of lines without characters, and character lines refer to the number of lines containing characters. Where web page text typically has a plurality of spaces, which in practice correspond to a single or a plurality of blank lines, the space length for a single blank line corresponds to a space of one blank line. For the interval corresponding to the plurality of consecutively arranged blank lines, the interval length is the number of blank lines corresponding to the interval, for example, the interval corresponding to three consecutively arranged blank lines, and the interval length is three blank lines.

102. And calculating a reference interval length according to all intervals of the webpage text to be analyzed.

The reference interval length is set mainly by screening all intervals according to the reference interval length. The invention is not limited to the calculation method of the reference interval length. For example, the method may be to calculate an average value of all intervals of the web page text to be analyzed, or calculate an intermediate value of all intervals of the web page text to be analyzed, or may be any other specific form as long as the method for identifying the web page text according to the present invention can be executed according to the "reference interval length".

In some embodiments, the reference interval length is the average interval length of all intervals of the web page text to be analyzed.

The average interval length is illustrated here, and assuming that 6 intervals are included in the web page text to be analyzed, the length of the interval A1 is 3, the length of the interval A2 is 5, the length of the interval A3 is 1, the length of the interval A4 is 1, the length of the interval A5 is 6, and the length of the interval A6 is 2, the average interval of the web page text to be analyzed is 3. It should be noted here that the specific number of intervals and interval length of the web page text to be analyzed are merely examples, which are convenient to understand, and the present invention is not limited thereto.

103. All intervals of the web page text to be analyzed are filtered by the reference interval length to reserve intervals having a length greater than the reference interval length.

Typically, the total number of characters corresponding to the text region of the web page is the largest, and the intervals in the text region of the web page are mainly intervals that may exist between paragraphs or intervals that may exist between secondary or lower titles and paragraphs, and these intervals are typically small in length. The starting and ending positions of the text region of the web page are typically provided with demarcation intervals, which are typically of a greater length than the internal intervals that may be present in the text region of the web page. The method has the advantages that the reference interval length is utilized to filter the webpage text, the interval with the length being larger than the reference interval length is reserved, the interval with the length being smaller than or equal to the reference interval length is eliminated, the interval between paragraphs or the interval between a secondary title or a lower-level title and a paragraph (the internal interval of the webpage text) can be eliminated completely or to a great extent, further, the paragraphs are more compact, characters are more concentrated, and the demarcation interval is not filtered due to the fact that the demarcation interval has a larger length, so that the webpage text is conveniently identified later.

Continuing with the description of the specific number of intervals and interval length of the web page text to be analyzed in the above example, since the length of the interval A1 is 3, the length of the interval A2 (demarcation interval) is 5, the length of the interval A3 (internal interval) is 1, the length of the interval A4 (internal interval) is 1, the length of the interval A5 (demarcation interval) is 6, the length of the interval A6 is 2, and the average interval of the web page text to be analyzed is 3, all the intervals are filtered by the average interval length to obtain filtered intervals A2 and A5, and the length of the interval A2 is 5 and the length of the interval A5 is 6.

104. And searching the number of characters corresponding to any two lines and all the filtered intervals in an exhaustive manner, and determining the text of the webpage according to the number of characters corresponding to each search area (namely the area between any two lines) and all the filtered intervals.

Generally, since the number of characters corresponding to the text region of the web page is the largest, the character concentration of the text region of the web page is further improved after all the intervals are filtered by using the reference interval length, and the boundary interval between the start and end positions of the text of the web page is reserved, the text of the web page is easily determined according to the number of characters corresponding to each search region and all the intervals after the filtering, and the accuracy is high.

Specifically, the exhaustive search for the number of characters corresponding to any two lines and all the filtered intervals includes:

And taking the total line number m of the webpage text to be analyzed as an exhaustive search range, and searching the corresponding character number and all the filtered intervals between a first line number i and a second line number j in an exhaustive manner, wherein the first line number i is greater than or equal to 0, the first line number i is smaller than the second line number j, and the second line number j is smaller than or equal to the total line number m.

In some embodiments, determining the text of the web page according to the number of characters corresponding to each search area and all the filtered intervals includes:

The reference value is obtained according to all the intervals after filtering the search areas, the number of the intervals after filtering the search areas is zero, and the reference value is minimum.

Since the character density is typically the ratio of the number of characters to the number of corresponding lines, there are some lines of characters in the web page, and some lines of characters are smaller, for example, only a few characters may appear in the last line of the paragraph. Therefore, the number of lines is selected to calculate the character density, which is easy to be interfered by the number of characters of each line, especially the number of lines with smaller number of characters, so that the calculated character density is smaller. And the reference value corresponding to each search area is introduced to calculate the character density, so that the influence of the line number of smaller or larger character numbers can be basically avoided. Specifically, the reference value is obtained according to the filtered interval, generally speaking, the number of characters in the text region of the web page is the largest, if the internal interval in the text region of the web page can be completely or largely filtered, the reference value corresponding to the text region of the web page is relatively smaller, so that when the ratio of the number of characters to the reference value is introduced to calculate the character density, the character density corresponding to the text region of the web page is basically the largest, and the recognition method of the web page text of the application has higher accuracy. In addition, when the search area has no filtered interval, the reference value of the search area is set to be minimum, for example, the reference value is set to be 1, so that the character density of the search area is the number of characters of the search area when the character density is calculated.

In some embodiments, the reference value is the sum of the lengths of all intervals after filtering for each search region.

In general, since the demarcation interval between the starting position and the ending position of the web page text has a larger length, the reference value is set to be the sum of the lengths of all the intervals after filtering each search area, and when the character density of each search area is calculated by using the ratio of the number of characters to the reference value, the web page text area can be obviously distinguished from other areas, thereby being beneficial to improving the accuracy of identifying the web page text.

Specifically, when there is no corresponding filtered interval in the search area, the character density of the search area is the number of characters of the search area, i.e., the reference value is set to 1 (minimum reference value). Of course, this is merely an arrangement in a specific example, and is not limited thereto.

Continuing with the above description of the specific number of intervals and interval length of the web page text to be analyzed, assuming that the web page text to be analyzed has a total number of 30 lines, and contains 6 intervals, the interval A1 has a length of 3, the interval A2 (demarcation interval) has a length of 5, the interval A3 has a length of 1 (internal interval), the interval A4 has a length of 1 (internal interval), the interval A5 (demarcation interval) has a length of 6, the interval A6 has a length of 2, the web page text to be analyzed has an average interval of 3, the filtered intervals are A2 and A5, the interval A2 has a length of 5 and the interval A5 has a length of 6.

And taking the total line number 30 as an exhaustive search range, searching the corresponding character number and all the filtered intervals between the first line number i and the second line number j in an exhaustive manner, and calculating the corresponding character density according to the ratio of the corresponding character number of each search area to the sum of all the filtered interval lengths, wherein the first line number i is greater than or equal to 0, the first line number i is smaller than the second line number j, and the second line number j is smaller than or equal to the total line number 30.

After the exhaustive search, the character densities corresponding to the search areas are compared, and it should be noted that only a portion of the search areas are used for illustration purposes only. Assuming that the search area B1 is a start point to an interval A2, the number of characters of the search area B1 is 10, the search area B2 is an interval A2 to an interval A5 (excluding the interval A2 and the interval A5 themselves), the number of characters of the search area B2 is 300, the search area B3 is an interval A5 to an end, and the number of characters of the search area B3 is 60. Search area B4 is from start to interval A5, search area B5 is from interval A2 to end, and search area B6 is from start to end. Since the filtered interval is the interval A2 and the interval A5, and the length of the interval A2 is 5 and the length of the interval A5 is 6, the search area B2 has no filtered interval, so the reference value thereof is 1, and the character density of the search area B2 is 300/1=300. The character density of the search area B4 is 310/5=62, and the character density of the search area B5 is 360/6=60. Since the number of characters in the text region of the web page is usually the largest, and the internal space in the text region of the web page can be basically filtered out completely or more, the reference value is usually smaller, so that when the ratio of the number of characters to the reference value is introduced to calculate the character density, the character density corresponding to the text region of the web page is basically the largest, and the largest character density search region B2 is extracted as the web page text. It should be noted that the specific number of intervals, interval length, number of rows and number of characters corresponding to the text of the web page to be analyzed are merely examples, which are convenient to understand, and the invention is not limited thereto.

According to the method, the reference interval length is obtained through calculation according to all intervals of the webpage text to be analyzed, the reference interval length is utilized to filter all intervals, if the webpage text with two demarcation intervals exists, the webpage text can be completely or largely filtered under normal conditions, gaps between paragraphs can be eliminated or reduced, the text parts of the webpage text are more concentrated, and the demarcation intervals of the starting position and the ending position of the webpage text are basically reserved due to the fact that the lengths are larger. In addition, other intervals which are equal to or longer than the demarcation interval length can exist outside the webpage text, so that the accuracy of recognition can be affected to a certain extent by simply using the two demarcation intervals to recognize the webpage text, and the webpage text can be easily determined by using the number of characters corresponding to each search area and all the intervals after filtering, so that the recognition accuracy can be improved, and the method is also suitable for various types of webpages.

Referring to fig. 2, the embodiment of the application also discloses a device for identifying the text of the web page, which comprises an acquisition module 10, a calculation module 11, a filtering module 12 and a searching and determining module 13.

The obtaining module 10 is configured to obtain a web page text to be analyzed, where the web page text to be analyzed includes character lines and blank lines, a single blank line and a plurality of continuously arranged blank lines are regarded as intervals, and the number of blank lines corresponding to the intervals indicates the length of the intervals.

The calculating module 11 is configured to calculate a reference interval length according to all intervals of the text of the web page to be analyzed.

The filtering module 12 is configured to filter all intervals of the web page text to be analyzed by using the reference interval length, so as to reserve intervals with a length greater than the reference interval length.

The searching and determining module 13 is configured to search for the number of characters corresponding to any two rows and all the filtered intervals in an exhaustive manner, and determine the text of the web page according to the number of characters corresponding to each search area and all the filtered intervals.

Further, the reference interval length is an average interval length of all intervals of the web page text to be analyzed.

Further, determining the text of the webpage according to the number of characters corresponding to each search area and all the filtered intervals, wherein the method comprises the following steps:

Further, the reference value is the sum of the lengths of all intervals after filtering of each search area.

The specific description of the device for identifying the text of the web page is detailed in the above method for identifying the text of the web page, and will not be repeated here.

Referring to fig. 3, the embodiment of the application further discloses an electronic device, which includes:

a processor 21;

A memory 20 in which executable instructions of the processor 21 are stored;

Wherein the processor 21 is configured to perform the above-described method of identifying a body of a web page via execution of executable instructions.

The embodiment of the application also discloses a computer readable storage medium, which stores a computer program, and the computer program realizes the method for identifying the webpage text when being executed by a processor.

Embodiments of the present application also disclose a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the electronic device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the electronic device executes the above-mentioned web page text recognition method.

It should be appreciated that in embodiments of the present application, the processor may be a central processing module (CentralProcessing Unit, CPU), which may also be other general purpose processors, digital signal processors (DIGITALSIGNAL PROCESSOR, DSP), application SPECIFIC INTEGRATED Circuit (ASIC), off-the-shelf Programmable gate array (Field-Programmable GATE ARRAY, FPGA) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

Those skilled in the art will appreciate that all or part of the processes in the methods of the embodiments described above may be implemented by hardware associated with computer program instructions, where the program may be stored on a computer readable storage medium, where the program, when executed, may include processes in embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disc, a Read-Only Memory (ROM), a Random access Memory (Random AccessMemory, RAM), or the like.

The foregoing disclosure is only illustrative of the preferred embodiments of the present application and is not to be construed as limiting the scope of the application, which is defined by the appended claims.

Claims

1. The method for identifying the text of the webpage is characterized by comprising the following steps of:

2. The method for recognizing text in a web page according to claim 1,

The reference interval length is the average interval length of all intervals of the webpage text to be analyzed.

3. The method for recognizing text in a web page according to claim 1,

And determining the text of the webpage according to the character number corresponding to each search area and all the filtered intervals, wherein the method comprises the following steps:

4. The method for recognizing text in a web page according to claim 3,

The reference value is the sum of the lengths of all intervals after filtering of each search area.

5. A web page text recognition device, comprising:

6. The web page text recognition apparatus of claim 5,

7. The web page text recognition apparatus of claim 5,

8. The web page text recognition apparatus of claim 7,

9. An electronic device, comprising:

A processor;

a memory having stored therein executable instructions of the processor;

Wherein the processor is configured to perform the method of identifying a web page body of any one of claims 1-4 via execution of the executable instructions.

10. A computer readable storage medium, on which a computer program is stored, which computer program, when being executed by a processor, implements a method of identifying a body of a web page as claimed in any one of claims 1-4.