CN113378088A - Webpage text extraction method, device, equipment and storage medium - Google Patents

Webpage text extraction method, device, equipment and storage medium Download PDF

Info

Publication number
CN113378088A
CN113378088A CN202110707708.0A CN202110707708A CN113378088A CN 113378088 A CN113378088 A CN 113378088A CN 202110707708 A CN202110707708 A CN 202110707708A CN 113378088 A CN113378088 A CN 113378088A
Authority
CN
China
Prior art keywords
webpage
text
label
sliding window
text information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110707708.0A
Other languages
Chinese (zh)
Other versions
CN113378088B (en
Inventor
刘旭东
张尼
薛继东
苏马婧
宋栋
刘红
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
6th Research Institute of China Electronics Corp
Original Assignee
6th Research Institute of China Electronics Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 6th Research Institute of China Electronics Corp filed Critical 6th Research Institute of China Electronics Corp
Priority to CN202110707708.0A priority Critical patent/CN113378088B/en
Publication of CN113378088A publication Critical patent/CN113378088A/en
Application granted granted Critical
Publication of CN113378088B publication Critical patent/CN113378088B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/957Browsing optimisation, e.g. caching or content distillation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/14Tree-structured documents

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The application provides a webpage text extraction method, a device, equipment and a storage medium, wherein the method comprises the following steps: extracting webpage body paragraphs from webpage content, adding the extracted webpage body paragraphs into a text file, calculating the minimum quotient of the number of punctuations between each start label and each end label in the webpage content and the number of characters contained in the punctuations, determining a label sliding window according to the start text information and the end text information in the extraction template, traversing the number of the punctuations in the webpage content and the number of characters contained in the webpage content according to the label sliding window, and extracting the webpage body paragraphs which accord with the webpage body judgment threshold. The beneficial effect of this application lies in: the method and the device can accurately extract the webpage text paragraphs from the webpage content according to the webpage text determination threshold, improve the extraction precision, avoid the redundancy problem of extracting the webpage text paragraphs, and effectively improve the extraction efficiency through a sliding window algorithm.

Description

Webpage text extraction method, device, equipment and storage medium
Technical Field
The present application relates to the field of computer technologies, and in particular, to a method, an apparatus, a device, and a storage medium for extracting a web page text.
Background
With the rapid development of the mobile internet technology, under the background that big data, cloud computing and emerging information technologies are continuously and widely applied, the information overload degree is increased, the information diversification trend is obvious, the web page hierarchy is richer due to the application of technologies such as CSS, JS and TS, meanwhile, different web page structures present diversification and individuation characteristics, a user often difficultly focuses on the core content of the web page quickly, a large amount of web page noise can cause that the extraction of the text information of the web page becomes more difficult and complicated, the text content in the web page is mainly concentrated in a partial area of the web page, the labels in the area are numerous and disordered and can contain a plurality of symbols, special characters and the like, and great challenges are brought to the accurate extraction of the text of the web page.
The current webpage text extraction generally comprises the following steps: firstly, original HTML codes of web pages are requested and downloaded, tags containing text paragraphs are analyzed, the texts are extracted according to the tag meanings, the extraction mode has certain technical defects, and due to the fact that the texts in different websites are different in position and the HTML structures are different, corresponding text extraction rules cannot be formulated for all the pages to extract the text contents; secondly, extracting the webpage text based on the tag density judgment, counting the number of characters in the tag according to the characteristic that part of HTML tags in the webpage text have lower density to judge whether the webpage text is the webpage text, wherein the extraction mode still has larger deviation in practical application; thirdly, webpage text extraction based on a visual and deep learning mode, the mode mainly depends on conditions such as unique meaning characteristics and sample data size of the webpage text, the extraction mode cannot be widely applied in a large scale, and universality of the contents of various types of webpage texts is difficult to realize.
Disclosure of Invention
In view of this, an embodiment of the present application provides a method for extracting a web page body, where web page content is obtained by using a template tag pair extraction manner, body HTML tags in websites with different structures can be unified, so that the universal applicability is improved, a web page body determination threshold is calculated, a web page body paragraph can be accurately extracted from the web page content according to the web page body determination threshold, and the extracted web page body paragraph is added to a text file for deduplication, so as to effectively avoid a redundancy problem of extracting the web page body paragraph, and a linear sliding is performed according to a start text information array and an end text information array in an extraction template by using a sliding window algorithm, so that the extraction efficiency is effectively improved.
In a first aspect, an embodiment of the present application provides a method for extracting a text from a web page, where the method includes:
cleaning all noise labels and script codes in the webpage source codes by using a regular expression, and obtaining webpage contents after cleaning;
acquiring an extraction template corresponding to the webpage content, wherein the extraction template comprises at least one initial text message and one end text message;
traversing the initial text information and the ending text information in the extraction template one by using a recursive algorithm, extracting a webpage body paragraph from the webpage content according to the initial text information and the ending text information, and adding the extracted webpage body paragraph into a text file;
calculating the minimum quotient value of the number of the punctuations of the webpage source code between each starting label and each ending label in the webpage content and the number of the characters, and taking the obtained minimum quotient value as a webpage text judgment threshold value;
determining a label sliding window according to the initial text information and the ending text information in the extracted template by adopting a sliding window algorithm;
and extracting the webpage text fields meeting the webpage text determination threshold according to the number of the punctuations and the number of the contained characters in the label sliding window traversing the webpage content.
In some embodiments, obtaining an extraction template corresponding to the web page content, where the extraction template includes at least a start text message and an end text message, includes:
replacing each initial label in the labels of the webpage content with initial text information corresponding to the extracted template through a regular expression;
and replacing each end label in the labels of the webpage content with end text information corresponding to the extracted template through a regular expression.
In some embodiments, traversing the start text information and the end text information in the extraction template one by using a recursive algorithm, extracting a web page body paragraph from the web page content according to the start text information and the end text information, and adding the extracted web page body paragraph to the text file, includes:
traversing the initial text information and the ending text information in the extracted template one by applying a recursive algorithm, and marking the label positions corresponding to each initial text information and each ending text information;
extracting each starting text information and each webpage body paragraph corresponding to the ending text information from the webpage content according to the labeling position of the label through a regular expression, and adding the extracted webpage body paragraphs into a text file, wherein the webpage body paragraphs are one or more.
In some embodiments, calculating a minimum quotient value of the number of punctuation marks of a webpage source code between each start label and each end label in the webpage content and the number of characters included in the webpage source code, and using the obtained minimum quotient value as a webpage text determination threshold value includes:
respectively calculating a quotient value of dividing the punctuation mark quantity of the webpage source code between each initial label and the corresponding end label in the webpage content by the character quantity;
and summing the three minimum quotient values, then averaging, and taking the obtained average value as a webpage text judgment threshold value.
In some embodiments, determining a sliding label window according to the start text information and the end text information in the extracted template by using a sliding window algorithm includes:
acquiring the pointer position of the initial text information in the extracted template by adopting a double-pointer mode, and determining the pointer position of the initial text information as a first boundary of a label sliding window;
continuously expanding the pointer backwards to the pointer position of the ending text information in the extracted template according to the first boundary of the label sliding window;
and determining a second boundary of the label sliding window according to the pointer position of the ending text message, namely, taking an index section of the pointer as the label sliding window, wherein the index section of the pointer indicates the distance between the first boundary of the label sliding window and the second boundary of the label sliding window.
In some embodiments, extracting a web page text field meeting a web page text determination threshold according to the number of the punctuation marks and the number of the contained characters in the web page content traversed by the label sliding window comprises:
continuously expanding the pointer to the pointer position of the ending text information in the extracted template according to the first boundary of the label sliding window;
while expanding the label sliding window, calculating the ratio of the number of the punctuations in the webpage content to the number of the contained characters until the ratio in the label sliding window is less than or equal to a webpage text judgment threshold, and stopping expanding the label sliding window;
moving the pointer to reduce the label sliding window until the ratio of the number of punctuations in the label sliding window to the number of characters contained in the punctuations is greater than the webpage text judgment threshold, and stopping reducing the label sliding window;
and extracting a webpage text field meeting a webpage text determination threshold value from the webpage content according to the label sliding window.
In a second aspect, an embodiment of the present application provides an apparatus for extracting text from a web page, where the apparatus includes:
the cleaning module is used for cleaning all noise labels and script codes in the webpage source codes by using a regular expression and obtaining webpage contents after cleaning;
the extraction template module is used for acquiring an extraction template corresponding to the webpage content, and the extraction template comprises at least one piece of starting text information and one piece of ending text information;
the traversal module is used for traversing the initial text information and the ending text information in the extraction template one by applying a recursive algorithm, extracting a webpage body paragraph from the webpage content according to the initial text information and the ending text information, and adding the extracted webpage body paragraph into a text file;
the calculation threshold module is used for calculating a quotient value of the number of the punctuations of the webpage source code between each start label and each end label in the webpage content and the number of the characters, and the obtained quotient value is used as a webpage text judgment threshold value;
a window determining module, configured to determine a label sliding window according to the start text information and the end text information in the extracted template by using a sliding window algorithm;
and the webpage text extraction module is used for extracting the webpage text fields meeting the webpage text determination threshold according to the number of characters in the label sliding window traversing webpage contents.
In some embodiments, the web page text extraction module comprises:
the first extraction unit is used for continuously expanding the pointer to the pointer position of the ending text information in the extraction template according to the first boundary of the label sliding window;
the second extraction unit is used for calculating the ratio of the number of the punctuations in the webpage content to the number of the contained characters while expanding the label sliding window, and stopping expanding the label sliding window until the ratio in the label sliding window is less than or equal to the webpage text judgment threshold;
the third extraction unit is used for moving the pointer to reduce the label sliding window until the ratio of the number of punctuation marks in the label sliding window to the number of characters contained in the punctuation marks is greater than the webpage text judgment threshold value, and stopping reducing the label sliding window;
and the fourth extraction unit is used for extracting the webpage text field meeting the webpage text determination threshold from the webpage content according to the label sliding window.
In a third aspect, an embodiment of the present application provides a computer device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor implements the steps of the web page text extraction method according to any one of claims 1 to 6 when executing the computer program.
In a fourth aspect, the present application provides a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to perform the steps of the web page text extraction method.
The beneficial effect of this application mainly lies in: the method obtains the webpage content by extracting the template label pair, unifies the HTML labels of the texts in websites with different structures, improves the universal applicability of webpage text extraction, extracts the webpage text paragraphs from the webpage content by calculating the minimum quotient value of the number of punctuations of a webpage source code between each start label and each end label in the webpage content and the number of characters, takes the obtained minimum quotient value as a webpage text judgment threshold value, extracts the webpage text paragraphs according to the webpage text judgment threshold value, can accurately extract the webpage text paragraphs, adds the extracted webpage text paragraphs into a text file for duplication removal, effectively avoids the redundancy problem of extracting the webpage text paragraphs, and can accurately extract the webpage text paragraphs by linearly sliding according to the start text information and the end text information arrays in the extraction template through a sliding window algorithm, the extraction efficiency is effectively improved.
In order to make the aforementioned objects, features and advantages of the present application more comprehensible, preferred embodiments accompanied with figures are described in detail below.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained from the drawings without inventive effort.
Fig. 1 shows a schematic flow chart of a webpage text extraction method provided by the embodiment of the application.
Fig. 2 is a flow chart illustrating a process of extracting a text paragraph of a web page according to an embodiment of the present application.
Fig. 3 is a schematic flowchart illustrating a process of calculating a web page text determination threshold according to an embodiment of the present application.
Fig. 4 shows a schematic flow chart of determining a label sliding window provided in an embodiment of the present application.
Fig. 5 is a flowchart illustrating a process of extracting a web page text field meeting a web page text determination threshold according to an embodiment of the present application.
Fig. 6 shows a schematic structural diagram of a web page text extraction apparatus according to an embodiment of the present application.
Fig. 7 shows a schematic structural diagram of a computer device provided in an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. The components of the embodiments of the present application, as generally described and illustrated in the figures herein, may be arranged and designed in a wide variety of different configurations and, thus, the following detailed description of the embodiments of the present application, which is provided in the figures, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The current webpage text extraction generally comprises the following steps: firstly, original HTML codes of web pages are requested and downloaded, tags containing text paragraphs are analyzed, texts are extracted according to tag meanings, the extraction mode has certain technical defects, and due to the fact that the texts in different websites are different in position and the HTML structures are different, corresponding text extraction rules cannot be formulated for all the web pages to extract the text contents; secondly, extracting the webpage text based on the tag density judgment, counting the number of characters in the tag according to the characteristic that part of HTML tags in the webpage text have lower density to judge whether the webpage text is the webpage text, wherein the extraction mode still has larger deviation in practical application; thirdly, webpage text extraction based on a visual and deep learning mode, the mode mainly depends on conditions such as unique meaning characteristics and sample data size of the webpage text, the extraction mode cannot be widely applied in a large scale, and universality of the contents of various types of webpage texts is difficult to realize.
In the prior art, the extraction problem of text contents of different websites can be solved to a certain extent by the ways, i.e., each HTML webpage is regarded as an individual file, dividing the characters by the number of labels in a line-by-line calculation mode, presenting the result in a histogram mode, and finally clustering the histogram to distinguish the text and the noise area, the text density feature is measured by dividing the number of all text characters in a tree node by the number of labels in a subtree with the node as the root, the structural feature of the tree is analyzed by using the web page, in the third mode, the DOM tree is divided, the density of the text and the punctuation is simultaneously researched, the extraction of the webpage text is carried out based on the text characteristic value, however, the method only considers the coverage of the text content of the webpage, ignores the redundancy of the text of the webpage, and simultaneously obtains better effect on partial websites.
In addition, in the third mode, the performance extracted by the mode of dividing the webpage DOM tree and researching the file and the punctuation density greatly depends on the text quality of the webpage text, and if the texts contained in different text paragraphs in the webpage are repeated, the extracted result still contains repeated paragraphs, and the situation of text redundancy is not considered.
In view of the defects in the prior art, the method and the device use the regular expression to clean all the noise labels and script codes in the webpage source codes, and obtain webpage contents after cleaning; acquiring an extraction template corresponding to the webpage content, wherein the extraction template comprises at least one initial text message and one ending text message; traversing the initial text information and the ending text information in the extraction template one by using a recursive algorithm, extracting a webpage body paragraph from the webpage content according to the initial text information and the ending text information, and adding the extracted webpage body paragraph into a text file; calculating the minimum quotient value of the number of the punctuations of the webpage source code between each starting label and each ending label in the webpage content and the number of the characters, and taking the obtained minimum quotient value as a webpage text judgment threshold value; determining a label sliding window according to the initial text information and the ending text information in the extracted template by adopting a sliding window algorithm; extracting a webpage text field which accords with a webpage text determination threshold value according to the number of the punctuations and the number of the contained characters in the label sliding window traversal webpage content; specifically, the initial label and the end label in the webpage content are replaced by the initial text information and the end text information in the extraction template, the webpage content is obtained by using a template label pair extraction mode, and the HTML labels of the texts in websites with different structures are unified, so that the universal applicability of webpage text extraction is improved; by calculating the minimum quotient value of the number of the punctuations of the webpage source code between each start label and each end label in the webpage content and the number of the characters, taking the obtained minimum quotient value as a webpage text judgment threshold value, extracting webpage text paragraphs from the webpage content according to the webpage text judgment threshold value, accurately extracting the webpage text paragraphs, adding the extracted webpage text paragraphs into a text file for duplication removal, and effectively avoiding the redundancy problem of extracting the webpage text paragraphs; the method comprises the steps of calculating the ratio of the number of the punctuations in the webpage content to the number of the characters contained in the webpage content through a sliding window algorithm, judging whether the ratio accords with a webpage text judgment threshold value, and if the ratio accords with the webpage text judgment threshold value, extracting webpage text paragraphs from the webpage content.
Some embodiments of the present application will be described in detail below with reference to the accompanying drawings. The embodiments described below and the features of the embodiments can be combined with each other without conflict.
Fig. 1 is a schematic flowchart illustrating a method for extracting web page text according to an embodiment of the present application; as shown in fig. 1, the method for extracting the text field of the web page specifically includes the following steps:
and step S10, cleaning all noise labels and script codes in the webpage source codes by using a regular expression, and obtaining webpage contents after cleaning.
When the step S10 is implemented specifically, the regular expression is used to clean all javascript script tags, style tags and script codes in the web page source code, so as to obtain the web page content.
Step S20, obtaining an extraction template corresponding to the web page content, where the extraction template includes at least one start text message and one end text message.
In the specific implementation of step S20, a template extraction manner is adopted to obtain an extraction template corresponding to the web page content, where the extraction template includes at least one piece of start text information and one piece of end text information, and one piece of start text information corresponds to one piece of end text information.
And step S30, traversing the initial text information and the end text information in the extraction template one by using a recursive algorithm, extracting the body paragraphs of the web pages from the web page contents according to the initial text information and the end text information, and adding the extracted body paragraphs of the web pages into the text file.
When the step S30 is implemented specifically, a recursive algorithm is used to traverse the initial text information and the end text information in the extracted template, and the positions of the initial text information and the end text information are labeled, and the labeled positions are taken as the initial points, and then the traversal is continued in sequence, and a plurality of sections of web page body paragraphs are extracted from the web page content according to the labeled positions, and the extracted web page body paragraphs are added to the text file.
Step S40, calculating the minimum quotient of the number of the punctuations of the webpage source code between each start label and each end label in the webpage content and the number of the characters contained in the punctuations, and using the obtained minimum quotient as the judgment threshold of the webpage text.
In the specific implementation of step S40, according to the number of punctuation marks and the number of contained characters of the web page source code between each start tag and each end tag in the web page content, a minimum quotient of the number of punctuation marks and the number of contained characters is calculated, an average value is obtained according to the sum of the three minimum values, and the average value is used as a web page text determination threshold.
And step S50, determining a label sliding window according to the initial text information and the ending text information in the extracted template by adopting a sliding window algorithm.
When the step S50 is implemented specifically, a double pointer manner of a sliding window algorithm is used, a first boundary of a label sliding window is determined according to a pointer position of the start text information in the extracted template, and according to the first boundary of the label sliding window, a pointer is continuously expanded backward to a pointer position of the end text information corresponding to the start document information, and is determined as a second boundary of the label sliding window, that is, an index interval of the pointer is used as the label sliding window.
And step S60, extracting the web page text field which accords with the web page text determination threshold value according to the number of the punctuation marks and the number of the contained characters in the web page contents traversed by the label sliding window.
Step S60 is implemented specifically, according to the number of the landmark symbols and the number of the included characters in the web content traversed by the label sliding window, calculating a ratio of the number of the landmark symbols and the number of the included characters in the web content in the label sliding window, determining whether the ratio meets a web text determination threshold, and if the ratio meets the web text determination threshold, extracting a web text field from the web content.
In a feasible implementation scheme, in step S10, the cleaning, by using a regular expression, all the noise labels and the script codes in the web page source codes, to obtain the web page content includes:
step 101, cleaning all javascript script tags, style tags and script codes in the webpage source codes by using a regular expression, and obtaining webpage contents after cleaning.
In the specific implementation of step 101, the regular expression < script \ s (\\ n |)? Cleaning a javascript script label in the webpage source code, and cleaning a style label in the webpage source code by using a regular expression.
In a possible implementation scheme, in step S20, obtaining an extraction template corresponding to the web page content, where the extraction template includes at least one start text message and one end text message, and includes:
step 201, replacing each initial label in the labels of the web page content with the initial text information of the corresponding extracted template through a regular expression.
In step 201, when the method is implemented specifically, each initial label in the labels of the web page content is replaced by the initial text information of the extracted template by using a regular expression, and each initial text information has the corresponding ending text information.
Step 202, replacing each end label in the labels of the web page content with the end text information of the corresponding extracted template through a regular expression.
In the specific implementation of step 202, each end tag in the tags of the web page content is replaced by the end text information of the extracted template by using the regular expression, and each end text information has the corresponding start text information.
In a possible implementation, fig. 2 shows a flowchart of extracting a text paragraph of a web page provided by an embodiment of the present application; in the step S30, the method includes traversing the extracted template one by using a recursive algorithm to extract the start text information and the end text information, extracting the body paragraphs of the web page from the web page content according to the start text information and the end text information, and adding the extracted body paragraphs of the web page to the text file, which includes the following steps:
step S301, traversing and extracting the initial text information and the ending text information in the template one by applying a recursive algorithm, and labeling the label position corresponding to each initial text information and each ending text information.
Step S302, extracting each starting text information and each webpage text paragraph corresponding to the ending text information from the webpage content according to the labeling position of the label through a regular expression, and adding the extracted webpage text paragraphs into a text file, wherein the webpage text paragraphs are one or more sections.
When the steps S301 and S302 are implemented specifically, a recursive algorithm is used to traverse each pair of labels corresponding to the start text information and each end text information one by one from the outermost layer of the extraction template, and label each start text information and each end text information corresponding to the start text information in the pair of labels in sequence according to the label positions, and through a regular expression, web page body paragraphs of the start text information and the end text information in the pair of labels are extracted from the web page content according to the label positions of the start text information and the end text information in the pair of labels, and added to the text file, and the web page body paragraphs are one or more.
For example: traversing the label position corresponding to each initial text information one by one from the outermost layer of the extracted template by applying a recursive algorithm, recording the position of the label as P1, taking the position of the P1 label as a starting point, then, the position of the label closest to the ending text information corresponding to the starting text information is searched, and through a regular expression, extracting the webpage text paragraphs of the initial text information and the ending text information in the label pair from the webpage content according to the labeling positions of the labels, and adding the label into the text file, sequentially traversing the label position corresponding to the next initial text information, recording the position of the label as P2, circularly and repeatedly continuously searching the label position closest to the ending text information corresponding to the initial text information, extracting the webpage body paragraphs of the initial text information and the ending text information in the label pair, and adding the webpage body paragraphs into the text file, wherein the webpage body paragraphs are one or more.
In a possible implementation, fig. 3 shows a schematic flow chart of calculating a web page text determination threshold provided in an embodiment of the present application; in the step S40, the method for calculating the minimum quotient of the number of the punctuation marks of the web source code between each start tag and each end tag in the web content and the number of the included characters, and using the obtained minimum quotient as the web text determination threshold specifically includes the following steps:
step S401, respectively calculating a quotient of the number of punctuation marks of the web page source code between each start tag and its corresponding end tag divided by the number of characters in the web page content.
And step S402, summing the three minimum quotient values, then averaging, and taking the obtained average value as a webpage text judgment threshold value.
In specific implementation of steps S401 and S402, the punctuation number and the number of characters of the web page source code between each start tag and its corresponding end tag in the web page content are obtained, a quotient obtained by dividing the punctuation number of the web page source code by the number of characters is calculated, the smaller the quotient is, the larger the text density between the start tag and its corresponding end tag is, the average value is obtained after summing up the three minimum quotient values, and the obtained average value is used as a web page text determination threshold value.
In one possible implementation, fig. 4 shows a schematic flow chart of determining a tag sliding window provided by an embodiment of the present application; in the step S50, a sliding window algorithm is adopted to determine a label sliding window according to the start text information and the end text information in the extracted template, and the method further includes the following steps:
step S501, a double-pointer mode is adopted to obtain the pointer position of the initial text information in the extracted template, and the pointer position of the initial text information is determined as the first boundary of the label sliding window.
Step S502, according to the first boundary of the label sliding window, the pointer is continuously expanded and then reaches the pointer position of the ending text information in the extracted template.
Step S503, determining a second boundary of the label sliding window according to the pointer position of the ending text message, that is, an index section of the pointer as a label sliding window, where the index section of the pointer indicates a distance between the first boundary of the label sliding window and the second boundary of the label sliding window.
When the steps S501, S502, and S503 are implemented specifically, a double-pointer manner is adopted according to the extracted template, a pointer position of first start text information in the extracted template is used as an index of an initialization pointer, the index of the initialization pointer is a rectangular interval, that is, a left index is equal to a right index, the left index of the initialization pointer is determined as a first boundary of a tag sliding window, the pointer is continuously expanded to a pointer position of ending text information in the extracted template according to the first boundary of the tag sliding window, the pointer position of ending text information is used as a right index of the initialization pointer, the right index of the initialization pointer is determined as a second boundary of the tag sliding window, and the first boundary of the tag sliding window and the second boundary of the tag sliding window are determined according to the index interval of the pointer.
In a possible implementation, fig. 5 is a schematic diagram illustrating a process of extracting a web page text field meeting a web page text determination threshold according to an embodiment of the present application; in the step S60, extracting the web page text field meeting the web page text determination threshold according to the number of the punctuations and the number of the characters included in the web page content traversed by the label sliding window, specifically includes the following steps:
step S601, according to the first boundary of the label sliding window, continuously expanding the pointer to the pointer position of the ending text information in the extracted template.
In the specific implementation of step S601, the first boundary of the label sliding window is set as the starting position, the pointer is continuously expanded backward to the pointer position of the ending text information in the extraction template according to the starting position, and the pointer position of the ending text information is set as the ending position of the extraction web page content.
Step S602, while expanding the label sliding window, calculating the ratio of the number of the punctuation marks in the webpage content to the number of the contained characters until the ratio in the label sliding window is less than or equal to the webpage text judgment threshold, and stopping expanding the label sliding window.
In the specific implementation of step S602, while the label sliding window is expanded, the ratio of the number of the punctuation marks in the webpage content to the number of the contained characters in the label sliding window is calculated, whether the ratio of the number of the punctuation marks in the webpage content to the number of the contained characters is less than or equal to the webpage text determination threshold is determined, and if the ratio is less than or equal to the webpage text determination threshold, the pointer is stopped from expanding the label sliding window, so that the second boundary of the label sliding window is obtained.
And step S603, moving the pointer to reduce the label sliding window until the ratio of the number of punctuations in the label sliding window to the number of characters contained in the punctuations is greater than the webpage text judgment threshold, and stopping reducing the label sliding window.
Step S603, when the specific implementation is performed, moving the pointer at the first boundary of the label sliding window, calculating a ratio of the number of the punctuation marks in the webpage content to the number of the contained characters in the label sliding window while reducing the label sliding window, determining whether the ratio of the number of the punctuation marks in the webpage content to the number of the contained characters is greater than a webpage text determination threshold, and if the ratio is greater than the webpage text determination threshold, stopping the pointer from reducing the label sliding window.
Step S604, according to the label sliding window, extracting the webpage text field which accords with the webpage text determination threshold value from the webpage content.
Step S604, in specific implementation, repeats step S602 and step S603 according to the sliding window of the tag, and extracts a web page text field meeting the web page text determination threshold from the text file according to the start text information and the end text information of the web page content, where the web page text field is one or more segments.
Fig. 6 is a schematic structural diagram of a web page text extraction apparatus according to an embodiment of the present application, and as shown in fig. 6, the apparatus includes:
the cleaning module 701 is used for cleaning all noise labels and script codes in the webpage source codes by using a regular expression, and obtaining webpage contents after cleaning;
an extraction template module 702, configured to obtain an extraction template corresponding to web page content, where the extraction template includes at least one start text message and one end text message;
a traversal module 703, configured to traverse the start text information and the end text information in the extraction template one by using a recursive algorithm, extract a web page body paragraph from the web page content according to the start text information and the end text information, and add the extracted web page body paragraph to the text file;
a threshold calculation module 704, configured to calculate a quotient between the number of punctuation marks of the web source code and the number of characters included in the web source code between each start tag and each end tag in the web content, where the obtained quotient is used as a web text determination threshold;
a window determining module 705, configured to determine a label sliding window according to the start text information and the end text information in the extracted template by using a sliding window algorithm;
and the web page text extraction module 706 is configured to extract a web page text field meeting a web page text determination threshold according to the number of characters in the web page content traversed by the label sliding window.
In one possible implementation, the web page text extraction module 705 includes:
and the first determining unit is used for acquiring the pointer position of the initial text information in the extracted template in a double-pointer mode, and determining the pointer position of the initial text information as the first boundary of the label sliding window.
And the second determining unit is used for continuously expanding the pointer to the pointer position of the ending text information in the extracted template according to the first boundary of the label sliding window.
And a third determining unit, configured to determine, as one label sliding window, a second boundary of the label sliding window, that is, an index section of the pointer, which indicates a distance between the first boundary of the label sliding window and the second boundary of the label sliding window, according to the pointer position of the end text information.
In one possible implementation, the web page text extraction module 706 includes:
and the first extraction unit is used for continuously expanding the pointer to the pointer position of the ending text information in the extraction template according to the first boundary of the label sliding window.
And the second extraction unit is used for calculating the ratio of the number of the punctuations in the webpage content to the number of the contained characters while expanding the label sliding window until the ratio in the label sliding window is less than or equal to the webpage text judgment threshold value, and stopping expanding the label sliding window.
And the third extraction unit is used for moving the pointer to reduce the label sliding window until the ratio of the number of punctuation marks in the label sliding window to the number of characters contained in the punctuation marks is greater than the webpage text judgment threshold value, and stopping reducing the label sliding window.
And the fourth extraction unit is used for extracting the webpage text field which accords with the webpage text determination threshold value from the webpage content according to the label sliding window.
The apparatus provided in the embodiments of the present application may be specific hardware on a device, or software or firmware installed on a device, etc. The device provided by the embodiment of the present application has the same implementation principle and technical effect as the foregoing method embodiments, and for the sake of brief description, reference may be made to the corresponding contents in the foregoing method embodiments where no part of the device embodiments is mentioned. It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the foregoing systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
Corresponding to the web page text extraction method in fig. 1, an embodiment of the present application further provides a computer device 80, fig. 7, as shown in fig. 7, the device includes a memory 801, a processor 802, and a computer program stored on the memory 801 and operable on the processor 802, where the processor 802 implements the web page text extraction method when executing the computer program.
Cleaning all noise labels and script codes in the webpage source codes by using a regular expression, and obtaining webpage contents after cleaning;
acquiring an extraction template corresponding to the webpage content, wherein the extraction template comprises at least one initial text message and one end text message;
traversing the initial text information and the ending text information in the extraction template one by using a recursive algorithm, extracting a webpage body paragraph from the webpage content according to the initial text information and the ending text information, and adding the extracted webpage body paragraph into a text file;
calculating the minimum quotient value of the number of the punctuations of the webpage source code between each starting label and each ending label in the webpage content and the number of the characters, and taking the obtained minimum quotient value as a webpage text judgment threshold value;
determining a label sliding window according to the initial text information and the ending text information in the extracted template by adopting a sliding window algorithm;
and extracting the webpage text fields meeting the webpage text determination threshold according to the number of the punctuations and the number of the contained characters in the label sliding window traversing the webpage content.
Corresponding to the web page text extraction method in fig. 1, an embodiment of the present application further provides a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to perform the following steps:
cleaning all noise labels and script codes in the webpage source codes by using a regular expression, and obtaining webpage contents after cleaning;
acquiring an extraction template corresponding to the webpage content, wherein the extraction template comprises at least one initial text message and one end text message;
traversing the initial text information and the ending text information in the extraction template one by using a recursive algorithm, extracting a webpage body paragraph from the webpage content according to the initial text information and the ending text information, and adding the extracted webpage body paragraph into a text file;
calculating the minimum quotient value of the number of the punctuations of the webpage source code between each starting label and each ending label in the webpage content and the number of the characters, and taking the obtained minimum quotient value as a webpage text judgment threshold value;
determining a label sliding window according to the initial text information and the ending text information in the extracted template by adopting a sliding window algorithm;
and extracting the webpage text fields meeting the webpage text determination threshold according to the number of the punctuations and the number of the contained characters in the label sliding window traversing the webpage content.
In the embodiments of the present application, when being executed by a processor, the computer program may further execute other machine-readable instructions to perform other methods described in the present application, and for specific implementation steps and principles, reference is made to the above description, which is not repeated herein in detail.
In the embodiments provided in the present application, it should be understood that the disclosed method and apparatus may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments provided in the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus once an item is defined in one figure, it need not be further defined and explained in subsequent figures, and moreover, the terms "first", "second", "third", etc. are used merely to distinguish one description from another and are not to be construed as indicating or implying relative importance.
Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present application, and are used for illustrating the technical solutions of the present application, but not limiting the same, and the scope of the present application is not limited thereto, and although the present application is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope disclosed in the present application; such modifications, changes or substitutions do not depart from the spirit and scope of the present disclosure, which should be construed in light of the above teachings. Are intended to be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (10)

1. A webpage text extraction method is characterized by comprising the following steps:
cleaning all noise labels and script codes in the webpage source codes by using a regular expression, and obtaining webpage contents after cleaning;
acquiring an extraction template corresponding to the webpage content, wherein the extraction template comprises at least one initial text message and one end text message;
traversing the initial text information and the ending text information in the extraction template one by using a recursive algorithm, extracting a webpage body paragraph from the webpage content according to the initial text information and the ending text information, and adding the extracted webpage body paragraph into a text file;
calculating the minimum quotient value of the number of the punctuations of the webpage source code between each starting label and each ending label in the webpage content and the number of the characters, and taking the obtained minimum quotient value as a webpage text judgment threshold value;
determining a label sliding window according to the initial text information and the ending text information in the extracted template by adopting a sliding window algorithm;
and extracting the webpage text fields meeting the webpage text determination threshold according to the number of the punctuations and the number of the contained characters in the label sliding window traversing the webpage content.
2. The method for extracting web page text according to claim 1, wherein obtaining an extraction template corresponding to the web page content, the extraction template including at least a start text message and an end text message, comprises:
replacing each initial label in the labels of the webpage content with initial text information corresponding to the extracted template through a regular expression;
and replacing each end label in the labels of the webpage content with end text information corresponding to the extracted template through a regular expression.
3. The method of claim 1, wherein traversing the initial text information and the end text information in the extraction template one by one using a recursive algorithm, extracting a web page body paragraph from the web page content according to the initial text information and the end text information, and adding the extracted web page body paragraph to the text file comprises:
traversing the initial text information and the ending text information in the extracted template one by applying a recursive algorithm, and marking the label positions corresponding to each initial text information and each ending text information;
extracting each starting text information and each webpage body paragraph corresponding to the ending text information from the webpage content according to the labeling position of the label through a regular expression, and adding the extracted webpage body paragraphs into a text file, wherein the webpage body paragraphs are one or more.
4. The method for extracting web page text according to claim 1, wherein calculating a minimum quotient value of the number of punctuation marks of the web page source code between each start tag and each end tag in the web page content and the number of characters included in the punctuation marks, and the obtained minimum quotient value is used as a web page text determination threshold value, comprising:
respectively calculating a quotient value of dividing the punctuation mark quantity of the webpage source code between each initial label and the corresponding end label in the webpage content by the character quantity;
and summing the three minimum quotient values, then averaging, and taking the obtained average value as a webpage text judgment threshold value.
5. The method for extracting web page text according to claim 1, wherein determining a label sliding window according to the start text information and the end text information in the extraction template by using a sliding window algorithm comprises:
acquiring the pointer position of the initial text information in the extracted template by adopting a double-pointer mode, and determining the pointer position of the initial text information as a first boundary of a label sliding window;
continuously expanding the pointer backwards to the pointer position of the ending text information in the extracted template according to the first boundary of the label sliding window;
and determining a second boundary of the label sliding window according to the pointer position of the ending text message, namely, taking an index section of the pointer as the label sliding window, wherein the index section of the pointer indicates the distance between the first boundary of the label sliding window and the second boundary of the label sliding window.
6. The method for extracting web page text according to claim 1, wherein extracting web page text fields meeting a web page text determination threshold according to the number of the punctuation marks and the number of the contained characters in the label sliding window traversing web page content comprises:
continuously expanding the pointer to the pointer position of the ending text information in the extracted template according to the first boundary of the label sliding window;
while expanding the label sliding window, calculating the ratio of the number of the punctuations in the webpage content to the number of the contained characters until the ratio in the label sliding window is less than or equal to a webpage text judgment threshold, and stopping expanding the label sliding window;
moving the pointer to reduce the label sliding window until the ratio of the number of punctuations in the label sliding window to the number of characters contained in the punctuations is greater than the webpage text judgment threshold, and stopping reducing the label sliding window;
and extracting a webpage text field meeting a webpage text determination threshold value from the webpage content according to the label sliding window.
7. An apparatus for extracting text from a web page, the apparatus comprising:
the cleaning module is used for cleaning all noise labels and script codes in the webpage source codes by using a regular expression and obtaining webpage contents after cleaning;
the extraction template module is used for acquiring an extraction template corresponding to the webpage content, and the extraction template comprises at least one piece of starting text information and one piece of ending text information;
the traversal module is used for traversing the initial text information and the ending text information in the extraction template one by applying a recursive algorithm, extracting a webpage body paragraph from the webpage content according to the initial text information and the ending text information, and adding the extracted webpage body paragraph into a text file;
the calculation threshold module is used for calculating a quotient value of the number of the punctuations of the webpage source code between each start label and each end label in the webpage content and the number of the characters, and the obtained quotient value is used as a webpage text judgment threshold value;
a window determining module, configured to determine a label sliding window according to the start text information and the end text information in the extracted template by using a sliding window algorithm;
and the webpage text extraction module is used for extracting the webpage text fields meeting the webpage text determination threshold according to the number of characters in the label sliding window traversing webpage contents.
8. The web page text extraction apparatus according to claim 7, wherein the web page text extraction module comprises:
the first extraction unit is used for continuously expanding the pointer to the pointer position of the ending text information in the extraction template according to the first boundary of the label sliding window;
the second extraction unit is used for calculating the ratio of the number of the punctuations in the webpage content to the number of the contained characters while expanding the label sliding window, and stopping expanding the label sliding window until the ratio in the label sliding window is less than or equal to the webpage text judgment threshold;
the third extraction unit is used for moving the pointer to reduce the label sliding window until the ratio of the number of punctuation marks in the label sliding window to the number of characters contained in the punctuation marks is greater than the webpage text judgment threshold value, and stopping reducing the label sliding window;
and the fourth extraction unit is used for extracting the webpage text field meeting the webpage text determination threshold from the webpage content according to the label sliding window.
9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method of any of the preceding claims 1 to 6 are implemented by the processor when executing the computer program.
10. A computer-readable storage medium, having stored thereon a computer program which, when being executed by a processor, is adapted to carry out the steps of the method according to any one of claims 1 to 6.
CN202110707708.0A 2021-06-24 2021-06-24 Webpage text extraction method, device, equipment and storage medium Active CN113378088B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110707708.0A CN113378088B (en) 2021-06-24 2021-06-24 Webpage text extraction method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110707708.0A CN113378088B (en) 2021-06-24 2021-06-24 Webpage text extraction method, device, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN113378088A true CN113378088A (en) 2021-09-10
CN113378088B CN113378088B (en) 2024-01-19

Family

ID=77579037

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110707708.0A Active CN113378088B (en) 2021-06-24 2021-06-24 Webpage text extraction method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113378088B (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103810251A (en) * 2014-01-21 2014-05-21 南京财经大学 Method and device for extracting text
CN105095466A (en) * 2015-07-31 2015-11-25 山东大学 Web text information extraction method
CN105630941A (en) * 2015-12-23 2016-06-01 成都电科心通捷信科技有限公司 Statistics and webpage structure based Wen body text content extraction method
CN108491414A (en) * 2018-02-05 2018-09-04 中国科学院信息工程研究所 A kind of online abstracting method of news content and system of fusion topic feature
CN108664522A (en) * 2017-04-01 2018-10-16 优信互联(北京)信息技术有限公司 Web page processing method and device
CN109033282A (en) * 2018-07-11 2018-12-18 山东邦尼信息科技有限公司 A kind of Web page text extracting method and device based on extraction template
US20200175962A1 (en) * 2018-12-04 2020-06-04 Sorenson Ip Holdings, Llc Training speech recognition systems using word sequences
US20200364252A1 (en) * 2019-05-16 2020-11-19 Microsoft Technology Licensing, Llc Generating electronic summary documents for landing pages

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103810251A (en) * 2014-01-21 2014-05-21 南京财经大学 Method and device for extracting text
CN105095466A (en) * 2015-07-31 2015-11-25 山东大学 Web text information extraction method
CN105630941A (en) * 2015-12-23 2016-06-01 成都电科心通捷信科技有限公司 Statistics and webpage structure based Wen body text content extraction method
CN108664522A (en) * 2017-04-01 2018-10-16 优信互联(北京)信息技术有限公司 Web page processing method and device
CN108491414A (en) * 2018-02-05 2018-09-04 中国科学院信息工程研究所 A kind of online abstracting method of news content and system of fusion topic feature
CN109033282A (en) * 2018-07-11 2018-12-18 山东邦尼信息科技有限公司 A kind of Web page text extracting method and device based on extraction template
US20200175962A1 (en) * 2018-12-04 2020-06-04 Sorenson Ip Holdings, Llc Training speech recognition systems using word sequences
US20200364252A1 (en) * 2019-05-16 2020-11-19 Microsoft Technology Licensing, Llc Generating electronic summary documents for landing pages

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
AHMET SELMAN BOZKIR 等: "Layout-based computation of web page similarity ranks", 《INTERNATIONAL JOURNAL OF HUMAN-COMPUTER STUDIES》, vol. 110, pages 95 - 114, XP085266893, DOI: 10.1016/j.ijhcs.2017.10.008 *
刘鹏程: "结合块密度和标签路径特征的网页正文抽取方法研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》, no. 07, pages 138 - 1903 *
王健: "基于Hadoop的Web页面正文抽取技术的研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》, no. 02, pages 138 - 2874 *

Also Published As

Publication number Publication date
CN113378088B (en) 2024-01-19

Similar Documents

Publication Publication Date Title
KR20170123331A (en) Information extraction method and apparatus
US20120303636A1 (en) System and Method for Web Content Extraction
JP2012532395A (en) Selective content extraction
US8804139B1 (en) Method and system for repurposing a presentation document to save paper and ink
CN102081594A (en) Equipment and method for extracting enclosing rectangles of characters from portable electronic documents
CN108874934B (en) Page text extraction method and device
JP7493937B2 (en) Method, program and system for identifying a sequence of headings in a document
CN110990539B (en) Manuscript internal duplicate checking method and device and electronic equipment
CN102033866A (en) Method and system for checking chemical name
CN112733056B (en) Document processing method, device, equipment and storage medium
Yu et al. Web content information extraction based on DOM tree and statistical information
CN113378088A (en) Webpage text extraction method, device, equipment and storage medium
CN106897287B (en) Webpage release time extraction method and device for webpage release time extraction
KR20190090636A (en) Method for automatically editing pattern of document
CN112949290A (en) Text error correction method and device and communication equipment
CN110674286A (en) Text abstract extraction method and device and storage equipment
CN110826488B (en) Image identification method and device for electronic document and storage equipment
CN112241445B (en) Labeling method and device, electronic equipment and storage medium
CN113792545B (en) News event activity name extraction method based on deep learning
KR100907709B1 (en) Information extraction apparatus and method using block grouping
CN114220113A (en) Paper quality detection method, device and equipment
CN112560849B (en) Neural network algorithm-based grammar segmentation method and system
CN106874346B (en) Method and device for extracting page text in webpage
CN112287742B (en) Method and device for analyzing flow chart in file, computing equipment and storage medium
CN115331247A (en) Document structure identification method and device, electronic equipment and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant