CN113378088B - Webpage text extraction method, device, equipment and storage medium - Google Patents

Webpage text extraction method, device, equipment and storage medium Download PDF

Info

Publication number
CN113378088B
CN113378088B CN202110707708.0A CN202110707708A CN113378088B CN 113378088 B CN113378088 B CN 113378088B CN 202110707708 A CN202110707708 A CN 202110707708A CN 113378088 B CN113378088 B CN 113378088B
Authority
CN
China
Prior art keywords
text
webpage
sliding window
label
text information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110707708.0A
Other languages
Chinese (zh)
Other versions
CN113378088A (en
Inventor
刘旭东
张尼
薛继东
苏马婧
宋栋
刘红
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
6th Research Institute of China Electronics Corp
Original Assignee
6th Research Institute of China Electronics Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 6th Research Institute of China Electronics Corp filed Critical 6th Research Institute of China Electronics Corp
Priority to CN202110707708.0A priority Critical patent/CN113378088B/en
Publication of CN113378088A publication Critical patent/CN113378088A/en
Application granted granted Critical
Publication of CN113378088B publication Critical patent/CN113378088B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/957Browsing optimisation, e.g. caching or content distillation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/14Tree-structured documents

Abstract

The application provides a webpage text extraction method, device, equipment and storage medium, wherein the method comprises the following steps: extracting a webpage text paragraph from the webpage content, adding the extracted webpage text paragraph into a text file, calculating the minimum quotient of the punctuation number and the contained characters number between each starting label and each ending label in the webpage content, taking the obtained minimum quotient as a webpage text judgment threshold, determining a label sliding window according to the starting text information and the ending text information in an extraction template, traversing the punctuation number and the contained characters number in the webpage content according to the label sliding window, and extracting a webpage text field conforming to the webpage text judgment threshold. The beneficial effects of this application lie in: the method and the device can accurately extract the webpage text paragraphs from the webpage content according to the webpage text judging threshold, improve the extraction precision, avoid the redundancy problem of extracting the webpage text paragraphs, and effectively improve the extraction efficiency through a sliding window algorithm.

Description

Webpage text extraction method, device, equipment and storage medium
Technical Field
The present invention relates to the field of computer technologies, and in particular, to a method, an apparatus, a device, and a storage medium for extracting a web page text.
Background
With the rapid development of mobile internet technology, under the background that big data, cloud computing and emerging information technologies are continuously and widely applied, the information overload degree is aggravated, the information diversification trend is obvious, CSS, JS, TS and other technologies are applied, so that webpage levels are richer, different webpage structures display diversification and individuation characteristics, users often have difficulty in rapidly focusing on core contents of the webpage, a large amount of webpage noise can cause that the webpage text information is more difficult and complex to extract, the text contents in the webpage are mainly concentrated in partial areas of the webpage, labels in the areas are numerous and disordered and possibly contain a plurality of symbols, special characters and the like, and a great challenge is brought to accurate extraction of the webpage text.
The current web page text extraction typically includes the following: firstly, requesting and downloading original HTML codes of a webpage, analyzing a label containing text paragraphs from the original HTML codes, extracting the text according to the meaning of the label, wherein the extraction mode has a certain technical defect, and because the text is positioned in different positions in different websites and the HTML structures are different, the text content cannot be extracted by setting corresponding text extraction rules for all the pages; secondly, extracting the webpage text based on the label density judgment, and counting the number of characters in the label according to the characteristic that part of the HTML labels in the webpage text are low in density to judge whether the webpage text is the webpage text, wherein a large deviation still exists in the extraction mode in practical application; thirdly, web page text extraction based on vision and deep learning modes is mainly dependent on conditions such as unique meaning characteristics of web page text, sample data size and the like, the extraction mode cannot be widely applied on a large scale, and universality of multi-type web page text content is difficult to realize.
Disclosure of Invention
In view of this, the embodiment of the application provides a method for extracting web page text, which uses a mode of extracting template label pairs to obtain web page content, can unify text HTML labels in websites with different structures, improves universal applicability, and can accurately extract web page text paragraphs from the web page content according to the web page text judging threshold value by calculating the web page text judging threshold value, and adds the extracted web page text paragraphs into text files for deduplication, thereby effectively avoiding redundancy problems of extracting web page text paragraphs, and effectively improving extraction efficiency by linearly sliding according to initial text information and end text information arrays in an extraction template through a sliding window algorithm.
In a first aspect, an embodiment of the present application provides a method for extracting a web page text, where the method includes:
cleaning all noise labels and script codes in the webpage source codes by using a regular expression, and obtaining webpage content after cleaning;
obtaining an extraction template corresponding to the webpage content, wherein the extraction template comprises at least one initial text message and one end text message;
traversing the initial text information and the end text information in the extraction template one by using a recursion algorithm, extracting webpage text paragraphs from webpage contents according to the initial text information and the end text information, and adding the extracted webpage text paragraphs into a text file;
Calculating the minimum quotient of the punctuation mark number of the webpage source code and the contained character number between each starting label and each ending label in the webpage content, and taking the obtained minimum quotient as a webpage text judgment threshold;
adopting a sliding window algorithm, and determining a label sliding window according to the initial text information and the end text information in the extraction template;
and traversing the punctuation mark number and the character containing number in the webpage content according to the label sliding window, and extracting the webpage text field which accords with the webpage text judgment threshold.
In some embodiments, obtaining an extraction template corresponding to the web page content, where the extraction template includes at least one start text message and one end text message, and the method includes:
replacing each initial tag in the tags of the webpage content with initial text information corresponding to the extraction template through a regular expression;
and replacing each ending label in the labels of the webpage content with ending text information corresponding to the extraction template through a regular expression.
In some embodiments, traversing the initial text information and the end text information in the extraction template one by using a recursive algorithm, extracting a web page text paragraph from the web page content according to the initial text information and the end text information, and adding the extracted web page text paragraph to the text file, including:
Traversing the initial text information and the end text information in the extraction template one by using a recursion algorithm, and marking the label positions corresponding to each initial text information and each end text information;
and extracting each piece of initial text information and a corresponding piece of webpage text paragraph of each piece of end text information from the webpage content according to the labeling position of the label through a regular expression, and adding the extracted webpage text paragraphs into a text file, wherein the webpage text paragraphs are one or more sections.
In some embodiments, calculating a minimum quotient of a punctuation number of a web page source code between each start tag and each end tag in web page content and a number of characters contained in the web page source code, the minimum quotient being obtained as a web page text determination threshold, including:
calculating the quotient of the number of punctuation marks of the web page source codes between each starting tag and the corresponding ending tag in the web page content divided by the number of characters respectively;
and summing the three minimum quotient values, taking an average value, and taking the obtained average value as a webpage text judgment threshold value.
In some embodiments, a sliding window algorithm is used to determine a sliding window of the tag according to the start text information and the end text information in the extraction template, including:
Acquiring the pointer position of the initial text information in the extraction template by adopting a double pointer mode, and determining the pointer position of the initial text information as a first boundary of a label sliding window;
according to the first boundary of the label sliding window, continuously expanding the pointer backwards to the pointer position of ending text information in the extraction template;
and determining a second boundary of the label sliding window according to the pointer position of the ending text information, namely, taking an index interval of the pointer as a label sliding window, wherein the index interval of the pointer indicates the distance between the first boundary of the label sliding window and the second boundary of the label sliding window.
In some embodiments, extracting the text field of the web page that meets the text decision threshold of the web page according to the number of punctuation marks and the number of characters contained in the web page content traversed by the label sliding window comprises:
according to the first boundary of the label sliding window, continuously expanding the pointer backwards to the pointer position of the ending text information in the extraction template;
calculating the ratio of the number of punctuation marks to the number of characters contained in the webpage content while expanding the label sliding window until the ratio in the label sliding window is smaller than or equal to the webpage text judgment threshold value, and stopping expanding the label sliding window;
Moving the pointer to shrink the label sliding window until the ratio of the number of punctuation marks and the number of characters contained in the label sliding window is larger than the webpage text judgment threshold value, and stopping shrinking the label sliding window;
and extracting the webpage text field which accords with the webpage text judgment threshold value from the webpage content according to the label sliding window.
In a second aspect, an embodiment of the present application provides a web page text extraction device, where the device includes:
the cleaning module is used for cleaning all noise labels and script codes in the webpage source codes by using the regular expression, and obtaining webpage content after cleaning;
the extraction template module is used for obtaining an extraction template corresponding to the webpage content, and the extraction template comprises at least one initial text message and one end text message;
the traversal module is used for traversing the initial text information and the end text information in the extraction template one by using a recursion algorithm, extracting webpage text paragraphs from webpage contents according to the initial text information and the end text information, and adding the extracted webpage text paragraphs into a text file;
the calculating threshold module is used for calculating the quotient of the punctuation mark number of the webpage source code and the contained character number between each starting tag and each ending tag in the webpage content, and the obtained quotient is used as a webpage text judging threshold;
The window determining module is used for determining a label sliding window according to the initial text information and the end text information in the extraction template by adopting a sliding window algorithm;
and the webpage text extraction module is used for traversing the character number in the webpage content according to the label sliding window and extracting the webpage text field which accords with the webpage text judgment threshold.
In some embodiments, the web page text extraction module includes:
the first extraction unit is used for expanding the pointer continuously backwards to the pointer position for ending the text information in the extraction template according to the first boundary of the label sliding window;
the second extraction unit is used for calculating the ratio of the number of punctuation marks to the number of characters contained in the webpage content while expanding the label sliding window until the ratio in the label sliding window is smaller than or equal to the webpage text judgment threshold value, and stopping expanding the label sliding window;
the third extraction unit is used for moving the pointer to shrink the label sliding window until the ratio of the number of punctuation marks and the number of characters contained in the label sliding window is greater than the webpage text judgment threshold value, and stopping shrinking the label sliding window;
and the fourth extraction unit is used for extracting the webpage text field which accords with the webpage text judgment threshold value from the webpage content according to the label sliding window.
In a third aspect, an embodiment of the present application provides a computer device, including a memory, a processor, and a computer program stored in the memory and capable of running on the processor, where the steps of the method for extracting a web page text according to any one of the first aspect are implemented when the processor executes the computer program.
In a fourth aspect, embodiments of the present application provide a computer readable storage medium having a computer program stored thereon, which when executed by a processor performs steps of a method for extracting text from a web page.
The beneficial effects of this application mainly lie in: according to the method, the webpage content is obtained by means of the extraction template label pairs, text HTML labels in websites of different structures are unified, the universal applicability of webpage text extraction is improved, the minimum quotient value of the punctuation marks and the contained characters of the webpage source codes between each starting label and each ending label in the webpage content is calculated, the obtained minimum quotient value is used as a webpage text judgment threshold value, webpage text paragraphs are extracted from the webpage content according to the webpage text judgment threshold value, the extracted webpage text paragraphs can be accurately extracted, the extracted webpage text paragraphs are added into text files for duplication removal, the redundancy problem of the extracted webpage text paragraphs is effectively avoided, linear sliding is carried out according to the starting text information and the ending text information array in the extraction template through a sliding window algorithm, the webpage text paragraphs in the webpage content can be accurately extracted, and the extraction efficiency is effectively improved.
In order to make the above objects, features and advantages of the present application more comprehensible, preferred embodiments accompanied with figures are described in detail below.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments will be briefly described below, it being understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered limiting the scope, and that other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 shows a flow chart of a web page text extraction method according to an embodiment of the present application.
Fig. 2 shows a flowchart of extracting text paragraphs of a web page according to an embodiment of the present application.
Fig. 3 is a schematic flow chart of calculating a web page text decision threshold according to an embodiment of the present application.
Fig. 4 shows a schematic flow chart of a sliding window of a determination tag according to an embodiment of the present application.
Fig. 5 is a schematic flow chart of extracting a text field of a web page that meets a text decision threshold according to an embodiment of the present application.
Fig. 6 shows a schematic structural diagram of a web page text extraction device according to an embodiment of the present application.
Fig. 7 shows a schematic structural diagram of a computer device according to an embodiment of the present application.
Detailed Description
For the purposes of making the objects, technical solutions and advantages of the embodiments of the present application more clear, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments. The components of the embodiments of the present application, as generally described and illustrated in the figures herein, may be arranged and designed in a wide variety of different configurations, and thus the following detailed description of the embodiments of the present application, as provided in the figures, is not intended to limit the scope of the application as claimed, but is merely representative of selected embodiments of the application. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present application.
The current web page text extraction typically includes the following: firstly, requesting and downloading original HTML codes of a webpage, analyzing a label containing text paragraphs from the original HTML codes, extracting text according to the meaning of the label, wherein the extraction mode has a certain technical defect, and because the text is positioned in different positions in different websites and the HTML structures are different, the text content cannot be extracted by setting corresponding text extraction rules for all the webpages; secondly, extracting the webpage text based on the label density judgment, and counting the number of characters in the label according to the characteristic that part of the HTML labels in the webpage text are low in density to judge whether the webpage text is the webpage text, wherein a large deviation still exists in the extraction mode in practical application; thirdly, web page text extraction based on vision and deep learning modes is mainly dependent on conditions such as unique meaning characteristics of web page text, sample data size and the like, the extraction mode cannot be widely applied on a large scale, and universality of multi-type web page text content is difficult to realize.
The above-mentioned several modes in the prior art can solve the extraction problem of different website text contents to a certain extent, in the above-mentioned mode one, regard each HTML webpage as the separate file, divide the text by the number of labels and present the result in the form of a histogram through the mode of calculating line by line, finally cluster the histogram to distinguish text and noise area, in the above-mentioned mode two, measure the text density characteristic by dividing the number of all text characters in the tree node by the number of labels in the subtree with the node as the root, utilize the structural feature of the webpage to analyze tree, in the above-mentioned mode three, utilize the mode three of dividing DOM tree to study text and punctuation density at the same time, extract the webpage text based on text characteristic value, but this mode only considers the coverage of the webpage text content, but ignores the redundancy of the webpage text, and obtains better effect on part of websites at the same time.
The above-mentioned method for extracting text from web page in common website based on belonging to the general website is mainly based on subjective analysis, experience and knowledge of web page structure, however, the text in different web pages is different, and HTML structure is different, so that the extraction method has no general applicability, and there is a large deviation in extraction result.
In view of the defects in the prior art, the method and the device use the regular expression to clean all noise labels and script codes in the webpage source codes, and obtain webpage content after cleaning; obtaining an extraction template corresponding to webpage content, wherein the extraction template comprises at least one initial text message and one end text message; traversing the initial text information and the end text information in the extraction template one by using a recursion algorithm, extracting webpage text paragraphs from webpage contents according to the initial text information and the end text information, and adding the extracted webpage text paragraphs into a text file; calculating the minimum quotient of the punctuation mark number of the webpage source code and the contained character number between each starting label and each ending label in the webpage content, and taking the obtained minimum quotient as a webpage text judgment threshold; adopting a sliding window algorithm, and determining a label sliding window according to the initial text information and the end text information in the extraction template; traversing the punctuation mark number and the character containing number in the webpage content according to the label sliding window, and extracting a webpage text field which accords with a webpage text judgment threshold; specifically, the initial tag and the end tag in the webpage content are replaced by initial text information and end text information in an extraction template, the webpage content is obtained by using a mode of extracting template tag pairs, text HTML tags in websites with different structures are unified, and the general applicability of webpage text extraction is improved; the minimum quotient value obtained by calculating the punctuation mark number and the minimum quotient value containing the character number of the web source code between each starting tag and each ending tag in the web content is used as a web text judgment threshold value, web text paragraphs are extracted from the web content according to the web text judgment threshold value, the web text paragraphs can be accurately extracted, and the extracted web text paragraphs are added into text files for duplication removal, so that the redundancy problem of the extracted web text paragraphs is effectively avoided; and calculating the ratio of the punctuation mark number to the character number in the webpage content through a sliding window algorithm, judging whether the ratio accords with the webpage text judgment threshold, and if so, extracting the webpage text paragraph from the webpage content.
Some embodiments of the present application are described in detail below with reference to the accompanying drawings. The following embodiments and features of the embodiments may be combined with each other without conflict.
Fig. 1 shows a flow chart of a web page text extraction method provided in an embodiment of the present application; as shown in fig. 1, the method for extracting the text field of the web page specifically includes the following steps:
and step S10, cleaning all noise labels and script codes in the webpage source codes by using a regular expression, and obtaining webpage content after cleaning.
And step S10, in the concrete implementation, cleaning all javascript labels, style labels and script codes in the webpage source codes by using a regular expression to obtain webpage content.
Step S20, an extraction template corresponding to the webpage content is obtained, wherein the extraction template comprises at least one piece of initial text information and one piece of end text information.
In the implementation of step S20, a template extraction mode is adopted to obtain an extraction template corresponding to the web page content, where the extraction template includes at least one initial text message and one end text message, and one initial text message and one end text message correspond to each other.
Step S30, traversing the initial text information and the end text information in the extraction template one by using a recursion algorithm, extracting the webpage text paragraphs from the webpage content according to the initial text information and the end text information, and adding the extracted webpage text paragraphs into the text file.
In the specific implementation of step S30, a recursive algorithm is used to traverse the initial text information and the end text information in the extraction template, the positions of the initial text information and the end text information are marked, the marked positions are used as starting points, the traversal is continued in sequence, a plurality of sections of web page text paragraphs are extracted from web page contents according to the marked positions, and the extracted web page text paragraphs are added into the text file.
And S40, calculating the minimum quotient of the punctuation mark number and the contained character number of the web page source code between each starting tag and each ending tag in the web page content, and taking the obtained minimum quotient as a web page text judgment threshold.
In the implementation of step S40, according to the punctuation number and the number of characters contained in the web page source code between each start tag and each end tag in the web page content, the minimum quotient of the punctuation number and the number of characters contained in the web page source code is calculated, and an average value is obtained according to the sum of the three minimum values and is used as the web page text judgment threshold.
And S50, determining a label sliding window by adopting a sliding window algorithm according to the initial text information and the end text information in the extraction template.
In the step S50, in the implementation, a double pointer mode of a sliding window algorithm is used, a first boundary of the tag sliding window is determined according to the pointer position of the initial text information in the extraction template, the pointer is continuously expanded backwards to the pointer position of the end text information corresponding to the initial file information according to the first boundary of the tag sliding window, and a second boundary of the tag sliding window, namely, the index section of the pointer is determined as the tag sliding window.
Step S60, according to the number of punctuation marks and the number of contained characters in the label sliding window traversal webpage content, extracting a webpage text field which accords with a webpage text judgment threshold.
In the implementation of step S60, according to the number of punctuation marks and the number of characters contained in the web page content traversed by the label sliding window, calculating the ratio of the number of punctuation marks and the number of characters contained in the web page content in the label sliding window, judging whether the ratio meets the web page text judgment threshold, and if so, extracting the web page text field from the web page content.
In a possible implementation scheme, in the step S10, cleaning all noise labels and script codes in the web page source code by using a regular expression, and obtaining web page content after cleaning, including:
and step 101, cleaning all javascript script labels, style labels and script codes in the webpage source codes by using a regular expression, and obtaining webpage content after cleaning.
In the specific implementation, step 101 is to use a regular expression < script \s (\n|.).
In a possible implementation manner, in the step S20, an extraction template corresponding to the web page content is obtained, where the extraction template includes at least one start text message and one end text message, and the method includes:
and 201, replacing each initial label in the labels of the webpage content with initial text information of the corresponding extraction template through a regular expression.
In the implementation of step 201, each initial tag in the tags of the web page content is replaced by initial text information of the extraction template by using a regular expression, and each initial text information has end text information corresponding to the initial text information.
And 202, replacing each end label in the labels of the webpage content with the end text information of the corresponding extraction template through the regular expression.
In step 202, in implementation, each end tag in the tags of the web page content is replaced by end text information of the extraction template by using a regular expression, and each end text information has start text information corresponding to the end text information.
In one possible implementation, fig. 2 shows a flow chart of extracting text paragraphs of a web page provided in an embodiment of the present application; in the step S30, the initial text information and the end text information in the extraction template are traversed one by using a recursive algorithm, the web page text paragraphs are extracted from the web page content according to the initial text information and the end text information, and the extracted web page text paragraphs are added to the text file, and the method specifically comprises the following steps:
step S301, traversing the initial text information and the end text information in the extraction template one by using a recursion algorithm, and marking the label positions corresponding to each initial text information and each end text information.
Step S302, extracting each piece of initial text information and a piece of webpage text paragraph corresponding to each piece of end text information from webpage content according to the labeling position of the label through a regular expression, and adding the extracted piece of webpage text paragraph into a text file, wherein the piece of webpage text paragraph is one or more pieces.
In the specific implementation of the steps S301 and S302, a recursive algorithm is used to traverse each label pair corresponding to the initial text information and each end text information from the outermost layer of the extraction template one by one, label each initial text information and the end text information corresponding to each initial text information in the label pair according to label positions, extract web text paragraphs of the initial text information and the end text information in the label pair from web content according to label positions of the initial text information and the end text information label pair through a regular expression, and add the web text paragraphs into a text file, wherein the web text paragraphs are one or more.
For example: traversing the label position corresponding to each initial text information one by one from the outermost layer of the extraction template by using a recursion algorithm, recording the position of the label as P1, taking the P1 label position as a starting point, searching the label position closest to the end text information corresponding to the initial text information, extracting the webpage text paragraphs of the initial text information and the end text information in the label pair from the webpage content according to the labeling position of the label through a regular expression, adding the webpage text paragraphs into a text file, traversing the label position corresponding to the next initial text information in sequence, recording the position of the label as P2, circularly repeating to continuously searching the label position closest to the end text information corresponding to the initial text information, extracting the webpage text paragraphs of the initial text information and the end text information in the label pair, and adding the webpage text paragraphs into the text file, wherein the webpage text paragraphs are one or more.
In one possible implementation, fig. 3 shows a schematic flow chart of calculating a web page text decision threshold according to an embodiment of the present application; in the step S40, the minimum quotient of the punctuation mark number of the web page source code and the number of characters contained between each start tag and each end tag in the web page content is calculated, and the obtained minimum quotient is used as the web page text judgment threshold, and specifically comprises the following steps:
in step S401, a quotient of the number of punctuation marks of the web page source code between each start tag and the corresponding end tag in the web page content divided by the number of characters is calculated.
And step S402, summing the three minimum quotient values, taking an average value, and taking the obtained average value as a webpage text judgment threshold value.
In the implementation of steps S401 and S402, the number of punctuations and the number of characters of the web page source code between each start tag and the corresponding end tag in the web page content are obtained, the quotient of the number of punctuations and the number of characters of the web page source code is calculated, the smaller the quotient is, the greater the text density between the start tag and the corresponding end tag is, the average value is obtained after summing the three minimum quotient values, and the obtained average value is used as the web page text judgment threshold.
In one possible implementation, fig. 4 shows a schematic diagram of a deterministic label sliding window procedure provided by an embodiment of the present application; in the step S50, a sliding window algorithm is adopted to determine a label sliding window according to the initial text information and the end text information in the extraction template, and the method further includes the following steps:
in step S501, a double pointer mode is adopted to obtain the pointer position of the initial text information in the extraction template, and the pointer position of the initial text information is determined as the first boundary of the label sliding window.
Step S502, according to the first boundary of the label sliding window, the pointer is continuously expanded backwards to the pointer position of the ending text information in the extraction template.
Step S503, determining a second boundary of the label sliding window according to the pointer position of the ending text information, i.e. the index section of the pointer is used as a label sliding window, wherein the index section of the pointer indicates the distance between the first boundary of the label sliding window and the second boundary of the label sliding window.
In the implementation of steps S501, S502, and S503, a double pointer mode is adopted according to the extraction template, the pointer position of the first initial text information in the extraction template is used as the index of the initialization pointer, the index of the initialization pointer is a rectangular section, that is, the left index is equal to the right index, the left index of the initialization pointer is determined to be the first boundary of the label sliding window, the pointer is continuously expanded backwards to the pointer position of the end text information in the extraction template according to the first boundary of the label sliding window, the pointer position of the end text information is used as the right index of the initialization pointer, the right index of the initialization pointer is determined to be the second boundary of the label sliding window, and the first boundary of the label sliding window and the second boundary of the label sliding window are determined according to the index section of the pointer.
In a possible implementation, fig. 5 shows a flow chart of extracting a web page text field that meets a web page text decision threshold according to an embodiment of the present application; in the step S60, the number of punctuation marks and the number of characters contained in the web page content are traversed according to the label sliding window, and the web page text field meeting the web page text judgment threshold is extracted, which specifically includes the following steps:
step S601, according to the first boundary of the label sliding window, the pointer is continuously expanded backwards to the pointer position of the ending text information in the extraction template.
In the implementation of step S601, the first boundary of the label sliding window is set as a starting position, and the pointer is continuously expanded backwards to a pointer position for ending text information in the extraction template according to the starting position, where the pointer position for ending text information is set as an ending position for extracting web page content.
And step S602, calculating the ratio of the number of punctuation marks to the number of characters contained in the webpage content while expanding the label sliding window until the ratio in the label sliding window is smaller than or equal to the webpage text judgment threshold value, and stopping expanding the label sliding window.
In the implementation of step S602, the ratio of the number of punctuation marks to the number of characters contained in the web page content in the tag sliding window is calculated while the tag sliding window is enlarged, and whether the ratio of the number of punctuation marks to the number of characters contained in the web page content is smaller than or equal to the web page text judgment threshold is determined, if the ratio is smaller than or equal to the web page text judgment threshold, the pointer is stopped to enlarge the tag sliding window, and the edge second boundary of the tag sliding window is obtained.
And step S603, moving the pointer to narrow the label sliding window until the ratio of the number of punctuation marks and the number of characters contained in the label sliding window is larger than the webpage text judgment threshold value, and stopping narrowing the label sliding window.
In step S603, when the pointer at the first boundary of the label sliding window is moved, the ratio of the number of punctuation marks to the number of characters contained in the web page content in the label sliding window is calculated while the label sliding window is narrowed, and whether the ratio of the number of punctuation marks to the number of characters contained in the web page content is greater than the web page text judgment threshold is determined, if so, the pointer is stopped to narrow the label sliding window.
Step S604, extracting the webpage text field which accords with the webpage text judgment threshold from the webpage content according to the label sliding window.
In the implementation of step S604, step S602 and step S603 are repeated according to the label sliding window, and the web page text field meeting the web page text decision threshold is extracted from the text file according to the start text information and the end text information of the web page content, where the web page text field is one or more sections.
Fig. 6 shows a schematic structural diagram of a web page text extraction device provided in an embodiment of the present application, where, as shown in fig. 6, the device includes:
The cleaning module 701 is configured to clean all noise tags and script codes in the web page source code by using a regular expression, and obtain web page content after cleaning;
the extraction template module 702 is configured to obtain an extraction template corresponding to the web page content, where the extraction template includes at least one initial text message and one end text message;
the traversing module 703 is configured to traverse the start text information and the end text information in the extraction template one by using a recursive algorithm, extract a web page text paragraph from the web page content according to the start text information and the end text information, and add the extracted web page text paragraph to the text file;
the calculating threshold module 704 is configured to calculate a quotient of a punctuation number of a web page source code between each start tag and each end tag in the web page content and a number of characters contained, where the obtained quotient is used as a web page text judging threshold;
the determining window module 705 is configured to determine a tag sliding window according to the initial text information and the end text information in the extraction template by using a sliding window algorithm;
the web page text extraction module 706 is configured to extract a web page text field that meets a web page text decision threshold according to the number of characters in the web page content traversed by the tag sliding window.
In one possible implementation, the web page text extraction module 706 includes:
the first determining unit is used for acquiring the pointer position of the initial text information in the extraction template by adopting a double pointer mode, and determining the pointer position of the initial text information as the first boundary of the label sliding window.
And the second determining unit is used for continuously expanding the pointer backwards to the pointer position for ending the text information in the extraction template according to the first boundary of the label sliding window.
And a third determining unit, configured to determine a second boundary of the tag sliding window according to the pointer position of the ending text information, that is, an index section of the pointer is used as one tag sliding window, and the index section of the pointer indicates a distance between the first boundary of the tag sliding window and the second boundary of the tag sliding window.
In one possible implementation, the web page text extraction module 706 includes:
and the first extraction unit is used for continuously expanding the pointer backwards to the pointer position for ending the text information in the extraction template according to the first boundary of the label sliding window.
And the second extraction unit is used for calculating the ratio of the number of punctuation marks to the number of characters contained in the webpage content while expanding the label sliding window until the ratio in the label sliding window is smaller than or equal to the webpage text judgment threshold value, and stopping expanding the label sliding window.
And the third extraction unit is used for moving the pointer to shrink the label sliding window until the ratio of the number of punctuation marks and the number of characters contained in the label sliding window is larger than the webpage text judgment threshold value, and stopping shrinking the label sliding window.
And the fourth extraction unit extracts the webpage text field which accords with the webpage text judgment threshold value from the webpage content according to the label sliding window.
The apparatus provided by the embodiments of the present application may be specific hardware on a device or software or firmware installed on a device, etc. The device provided in the embodiments of the present application has the same implementation principle and technical effects as those of the foregoing method embodiments, and for a brief description, reference may be made to corresponding matters in the foregoing method embodiments where the device embodiment section is not mentioned. It will be clear to those skilled in the art that, for convenience and brevity, the specific operation of the system, apparatus and unit described above may refer to the corresponding process in the above method embodiment, which is not described in detail herein.
Corresponding to the method for extracting the text of the web page in fig. 1, the embodiment of the application further provides a computer device 80, as shown in fig. 7, where the device includes a memory 801, a processor 802, and a computer program stored in the memory 801 and capable of running on the processor 802, where the method for extracting the text of the web page is implemented when the processor 802 executes the computer program.
Cleaning all noise labels and script codes in the webpage source codes by using a regular expression, and obtaining webpage content after cleaning;
obtaining an extraction template corresponding to the webpage content, wherein the extraction template comprises at least one initial text message and one end text message;
traversing the initial text information and the end text information in the extraction template one by using a recursion algorithm, extracting webpage text paragraphs from webpage contents according to the initial text information and the end text information, and adding the extracted webpage text paragraphs into a text file;
calculating the minimum quotient of the punctuation mark number of the webpage source code and the contained character number between each starting label and each ending label in the webpage content, and taking the obtained minimum quotient as a webpage text judgment threshold;
adopting a sliding window algorithm, and determining a label sliding window according to the initial text information and the end text information in the extraction template;
and traversing the punctuation mark number and the character containing number in the webpage content according to the label sliding window, and extracting the webpage text field which accords with the webpage text judgment threshold.
Corresponding to the web page text extraction method in fig. 1, the embodiment of the present application further provides a computer readable storage medium, on which a computer program is stored, which when executed by a processor performs the following steps:
Cleaning all noise labels and script codes in the webpage source codes by using a regular expression, and obtaining webpage content after cleaning;
obtaining an extraction template corresponding to the webpage content, wherein the extraction template comprises at least one initial text message and one end text message;
traversing the initial text information and the end text information in the extraction template one by using a recursion algorithm, extracting webpage text paragraphs from webpage contents according to the initial text information and the end text information, and adding the extracted webpage text paragraphs into a text file;
calculating the minimum quotient of the punctuation mark number of the webpage source code and the contained character number between each starting label and each ending label in the webpage content, and taking the obtained minimum quotient as a webpage text judgment threshold;
adopting a sliding window algorithm, and determining a label sliding window according to the initial text information and the end text information in the extraction template;
and traversing the punctuation mark number and the character containing number in the webpage content according to the label sliding window, and extracting the webpage text field which accords with the webpage text judgment threshold.
In the embodiments of the present application, the computer program may further execute other machine readable instructions when executed by the processor to perform other methods described in the present application, and the specific implementation of the method steps and principles are referred to in the foregoing description and will not be described in detail herein.
In the embodiments provided in the present application, it should be understood that the disclosed methods and apparatuses may be implemented in other manners. The above-described apparatus embodiments are merely illustrative, for example, the division of the units is merely a logical function division, and there may be other manners of division in actual implementation, and for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some communication interface, device or unit indirect coupling or communication connection, which may be in electrical, mechanical or other form.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in the embodiments provided in the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
It should be noted that: like reference numerals and letters in the following figures denote like items, and thus once an item is defined in one figure, no further definition or explanation of it is required in the following figures, and furthermore, the terms "first," "second," "third," etc. are used merely to distinguish one description from another and are not to be construed as indicating or implying relative importance.
Finally, it should be noted that: the foregoing examples are merely specific embodiments of the present application, and are not intended to limit the scope of the present application, but the present application is not limited thereto, and those skilled in the art will appreciate that while the foregoing examples are described in detail, the present application is not limited thereto. Any person skilled in the art may modify or easily conceive of the technical solution described in the foregoing embodiments, or make equivalent substitutions for some of the technical features within the technical scope of the disclosure of the present application; such modifications, changes or substitutions do not depart from the spirit and scope of the corresponding technical solutions. Are intended to be encompassed within the scope of this application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (8)

1. The method for extracting the webpage text is characterized by comprising the following steps:
cleaning all noise labels and script codes in the webpage source codes by using a regular expression, and obtaining webpage content after cleaning;
obtaining an extraction template corresponding to the webpage content, wherein the extraction template comprises at least one initial text message and one end text message;
Traversing the initial text information and the end text information in the extraction template one by using a recursion algorithm, extracting webpage text paragraphs from webpage contents according to the initial text information and the end text information, and adding the extracted webpage text paragraphs into a text file;
calculating the minimum quotient of the punctuation mark number of the webpage source code and the contained character number between each starting label and each ending label in the webpage content, and taking the obtained minimum quotient as a webpage text judgment threshold;
adopting a sliding window algorithm, and determining a label sliding window according to the initial text information and the end text information in the extraction template;
traversing the punctuation mark number and the character containing number in the webpage content according to the label sliding window, and extracting a webpage text field which accords with a webpage text judgment threshold;
traversing the punctuation mark number and the character containing number in the webpage content according to the label sliding window, and extracting the webpage text field which accords with the webpage text judgment threshold value, wherein the method comprises the following steps:
according to the first boundary of the label sliding window, continuously expanding the pointer backwards to the pointer position of the ending text information in the extraction template; the first boundary of the label sliding window is the pointer position of the initial text information in the extraction template;
Calculating the ratio of the number of punctuation marks to the number of characters contained in the webpage content while expanding the label sliding window until the ratio in the label sliding window is smaller than or equal to the webpage text judgment threshold value, and stopping expanding the label sliding window;
moving the pointer of the first boundary to shrink the label sliding window until the ratio of the number of punctuation marks and the number of characters contained in the label sliding window is larger than the webpage text judging threshold value, and stopping shrinking the label sliding window;
and extracting the webpage text field which accords with the webpage text judgment threshold value from the webpage content according to the label sliding window.
2. The method for extracting text from a web page according to claim 1, wherein obtaining an extraction template corresponding to the web page content, the extraction template including at least one start text message and one end text message, comprises:
replacing each initial tag in the tags of the webpage content with initial text information corresponding to the extraction template through a regular expression;
and replacing each ending label in the labels of the webpage content with ending text information corresponding to the extraction template through a regular expression.
3. The web page text extraction method of claim 1, wherein traversing the start text information and the end text information in the extraction template one by one using a recursive algorithm, extracting web page text paragraphs from web page content according to the start text information and the end text information, and adding the extracted web page text paragraphs to a text file comprises:
Traversing the initial text information and the end text information in the extraction template one by using a recursion algorithm, and marking the label positions corresponding to each initial text information and each end text information;
and extracting each piece of initial text information and a corresponding piece of webpage text paragraph of each piece of end text information from the webpage content according to the labeling position of the label through a regular expression, and adding the extracted webpage text paragraphs into a text file, wherein the webpage text paragraphs are one or more sections.
4. The web page text extraction method according to claim 1, wherein calculating a minimum quotient of a punctuation number of web page source codes and a contained character number between each start tag and each end tag in web page content, the obtained minimum quotient serving as a web page text judgment threshold value, comprises:
calculating the quotient of the number of punctuation marks of the web page source codes between each starting tag and the corresponding ending tag in the web page content divided by the number of characters respectively;
and summing the three minimum quotient values, taking an average value, and taking the obtained average value as a webpage text judgment threshold value.
5. The web page text extraction method of claim 1, wherein determining a tag sliding window from the start text information and the end text information in the extraction template using a sliding window algorithm comprises:
Acquiring the pointer position of the initial text information in the extraction template by adopting a double pointer mode, and determining the pointer position of the initial text information as a first boundary of a label sliding window;
according to the first boundary of the label sliding window, continuously expanding the pointer backwards to the pointer position of ending text information in the extraction template;
and determining a second boundary of the label sliding window according to the pointer position of the ending text information, namely, taking an index interval of the pointer as a label sliding window, wherein the index interval of the pointer indicates the distance between the first boundary of the label sliding window and the second boundary of the label sliding window.
6. A web page text extraction device, the device comprising:
the cleaning module is used for cleaning all noise labels and script codes in the webpage source codes by using the regular expression, and obtaining webpage content after cleaning;
the extraction template module is used for obtaining an extraction template corresponding to the webpage content, and the extraction template comprises at least one initial text message and one end text message;
the traversal module is used for traversing the initial text information and the end text information in the extraction template one by using a recursion algorithm, extracting webpage text paragraphs from webpage contents according to the initial text information and the end text information, and adding the extracted webpage text paragraphs into a text file;
The calculating threshold module is used for calculating the quotient of the punctuation mark number of the webpage source code and the contained character number between each starting tag and each ending tag in the webpage content, and the obtained quotient is used as a webpage text judging threshold;
the window determining module is used for determining a label sliding window according to the initial text information and the end text information in the extraction template by adopting a sliding window algorithm;
the webpage text extraction module is used for traversing the number of punctuation marks and the number of contained characters in the webpage content according to the label sliding window and extracting a webpage text field which accords with a webpage text judgment threshold;
the webpage text extraction module comprises:
the first extraction unit is used for expanding the pointer continuously backwards to the pointer position for ending the text information in the extraction template according to the first boundary of the label sliding window; the first boundary of the label sliding window is the pointer position of the initial text information in the extraction template;
the second extraction unit is used for calculating the ratio of the number of punctuation marks to the number of characters contained in the webpage content while expanding the label sliding window until the ratio in the label sliding window is smaller than or equal to the webpage text judgment threshold value, and stopping expanding the label sliding window;
The third extraction unit is used for moving the pointer of the first boundary to narrow the label sliding window until the ratio of the number of punctuation marks and the number of characters contained in the label sliding window is greater than the webpage text judgment threshold value, and stopping narrowing the label sliding window;
and the fourth extraction unit extracts the webpage text field which accords with the webpage text judgment threshold value from the webpage content according to the label sliding window.
7. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any of the preceding claims 1 to 5 when the computer program is executed.
8. A computer-readable storage medium, characterized in that it has stored thereon a computer program which, when executed by a processor, performs the steps of the method according to any of claims 1 to 5.
CN202110707708.0A 2021-06-24 2021-06-24 Webpage text extraction method, device, equipment and storage medium Active CN113378088B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110707708.0A CN113378088B (en) 2021-06-24 2021-06-24 Webpage text extraction method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110707708.0A CN113378088B (en) 2021-06-24 2021-06-24 Webpage text extraction method, device, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN113378088A CN113378088A (en) 2021-09-10
CN113378088B true CN113378088B (en) 2024-01-19

Family

ID=77579037

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110707708.0A Active CN113378088B (en) 2021-06-24 2021-06-24 Webpage text extraction method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113378088B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103810251A (en) * 2014-01-21 2014-05-21 南京财经大学 Method and device for extracting text
CN105095466A (en) * 2015-07-31 2015-11-25 山东大学 Web text information extraction method
CN105630941A (en) * 2015-12-23 2016-06-01 成都电科心通捷信科技有限公司 Statistics and webpage structure based Wen body text content extraction method
CN108491414A (en) * 2018-02-05 2018-09-04 中国科学院信息工程研究所 A kind of online abstracting method of news content and system of fusion topic feature
CN108664522A (en) * 2017-04-01 2018-10-16 优信互联(北京)信息技术有限公司 Web page processing method and device
CN109033282A (en) * 2018-07-11 2018-12-18 山东邦尼信息科技有限公司 A kind of Web page text extracting method and device based on extraction template

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10388272B1 (en) * 2018-12-04 2019-08-20 Sorenson Ip Holdings, Llc Training speech recognition systems using word sequences
US11449536B2 (en) * 2019-05-16 2022-09-20 Microsoft Technology Licensing, Llc Generating electronic summary documents for landing pages

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103810251A (en) * 2014-01-21 2014-05-21 南京财经大学 Method and device for extracting text
CN105095466A (en) * 2015-07-31 2015-11-25 山东大学 Web text information extraction method
CN105630941A (en) * 2015-12-23 2016-06-01 成都电科心通捷信科技有限公司 Statistics and webpage structure based Wen body text content extraction method
CN108664522A (en) * 2017-04-01 2018-10-16 优信互联(北京)信息技术有限公司 Web page processing method and device
CN108491414A (en) * 2018-02-05 2018-09-04 中国科学院信息工程研究所 A kind of online abstracting method of news content and system of fusion topic feature
CN109033282A (en) * 2018-07-11 2018-12-18 山东邦尼信息科技有限公司 A kind of Web page text extracting method and device based on extraction template

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Layout-based computation of web page similarity ranks;Ahmet Selman Bozkir 等;《International Journal of Human-Computer Studies》;第110卷;95-114 *
基于Hadoop的Web页面正文抽取技术的研究;王健;《中国优秀硕士学位论文全文数据库 信息科技辑》(第02期);I138-2874 *
结合块密度和标签路径特征的网页正文抽取方法研究;刘鹏程;《中国优秀硕士学位论文全文数据库 信息科技辑》(第07期);I138-1903 *

Also Published As

Publication number Publication date
CN113378088A (en) 2021-09-10

Similar Documents

Publication Publication Date Title
Hill et al. Quantifying the impact of dirty OCR on historical text analysis: Eighteenth Century Collections Online as a case study
US20110078562A1 (en) Method and system for tracking authorship of content in data
WO2011072434A1 (en) System and method for web content extraction
CN112685671A (en) Page display method, device, equipment and storage medium
CN110704719B (en) Enterprise search text word segmentation method and device
CN109815243B (en) Structured storage method and device during document interface modification
Yu et al. Web content information extraction based on DOM tree and statistical information
CN113378088B (en) Webpage text extraction method, device, equipment and storage medium
Kreuzer et al. A quantitative comparison of semantic web page segmentation approaches
KR20190090636A (en) Method for automatically editing pattern of document
CN105512335B (en) abstract searching method and device
CN110990539A (en) Manuscript internal duplicate checking method and device, storage medium and electronic equipment
CN114970502B (en) Text error correction method applied to digital government
CN110674286A (en) Text abstract extraction method and device and storage equipment
CN110826488B (en) Image identification method and device for electronic document and storage equipment
CN114220113A (en) Paper quality detection method, device and equipment
CN108255866B (en) Method and device for checking links in website
Cheers et al. Identifying plagiarised programming assignments based on source code similarity scores
CN115331247A (en) Document structure identification method and device, electronic equipment and readable storage medium
CN113190644B (en) Method and device for hot updating word segmentation dictionary of search engine
CN115238078A (en) Webpage information extraction method, device, equipment and storage medium
CN115270723A (en) PDF document splitting method, device, equipment and storage medium
KR100907709B1 (en) Information extraction apparatus and method using block grouping
CN112861481A (en) Paging processing method and device, electronic equipment and computer readable storage medium
CN110765079B (en) Table information searching method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant