CN114201700A

CN114201700A - Webpage text acquisition method and device, storage medium and electronic equipment

Info

Publication number: CN114201700A
Application number: CN202111509751.2A
Authority: CN
Inventors: 薛秋雨; 陈祖德; 潘仕江; 李天与; 柳超
Original assignee: Beijing Jindi Technology Co Ltd
Current assignee: Beijing Jindi Technology Co Ltd
Priority date: 2021-12-10
Filing date: 2021-12-10
Publication date: 2022-03-18

Abstract

The disclosure provides a webpage text acquisition method, a webpage text acquisition device, a storage medium and electronic equipment, and relates to the technical field of internet. The method comprises the following steps: acquiring a webpage source code of a target webpage; constructing a corresponding DOM tree according to the webpage source code; generating a corresponding node list according to the text density of each child node in the DOM tree; and aiming at each title in at least one title contained in the target webpage, under the condition that a text node matched with the current title exists in the node list, positioning the position of the text content in the target webpage and acquiring the text content at least according to the position relation between the text content corresponding to the text node and the text title corresponding to the text node in the target webpage.

Description

Webpage text acquisition method and device, storage medium and electronic equipment

Technical Field

The present disclosure relates to the field of internet technologies, and in particular, to a method and an apparatus for acquiring a web page text, a storage medium, and an electronic device.

Background

With the advent of the data era, more and more information needs to be acquired on web pages, and the complicated web page layout is important for browsing or acquiring important information in the web pages and difficulty thereof, for example, for some data of bulletins and news, how to quickly locate in the web pages so as to acquire the required information.

Disclosure of Invention

The disclosure provides a webpage text acquisition method, a webpage text acquisition device, a storage medium and electronic equipment, which are used for quickly positioning and acquiring needed information in a webpage.

In a first aspect of the embodiments of the present disclosure, a method for acquiring a text of a webpage is provided, including:

acquiring a webpage source code of a target webpage;

constructing a corresponding DOM tree according to the webpage source code;

generating a corresponding node list according to the text density of each child node in the DOM tree;

and aiming at each title in at least one title contained in the target webpage, under the condition that a text node matched with the current title exists in the node list, positioning the position of the text content in the target webpage and acquiring the text content at least according to the position relation between the text content corresponding to the text node and the text title corresponding to the text node in the target webpage.

Optionally, at least positioning the position of the text content in the target webpage and acquiring the text content according to the position relationship between the text content corresponding to the text node and the text title corresponding to the text node in the target webpage, including:

and positioning the position of the text content in the target webpage and acquiring the text content according to at least one ancestor node of the text node and the position relation between the text content corresponding to the text node and the text title corresponding to the text node in the target webpage.

Optionally, the text density of each child node in the DOM tree is obtained by:

and aiming at each child node in the DOM tree, carrying out logarithm operation on the number of the titles contained in the corresponding child node and the number of the characters contained in the text corresponding to each title to obtain the text density of the corresponding child node.

Optionally, the text node in the node list matching the current title is determined by:

determining a first target node which has a preset value of the degree of correlation with the current title and is a non-webpage anchor point in the node list;

and taking the first target node as a text node matched with the current title.

Optionally, the method further comprises:

under the condition that the text node matched with the current title does not exist in the node list, circularly traversing the node list until the traversal times reach preset times so as to determine a second target node with the highest relevance with the current title in the node list;

taking the second target node as a text node corresponding to the current title;

and positioning the position of the text content in the target webpage and acquiring the text content at least according to the position relation between the text content corresponding to the text node and the text title corresponding to the text node in the target webpage.

Optionally, the method further includes obtaining each title of the at least one title included in the target webpage by at least one of the following methods:

acquiring each title in the at least one title contained in a target webpage according to the label characteristics of the titles in the target webpage;

and acquiring each title in the at least one title contained in the target webpage by acquiring the URL of the target webpage or the title input along with the webpage source code.

Optionally, the obtaining each title of the at least one title included in the target webpage according to the tag feature of the title in the target webpage includes:

determining a target label in the target webpage as a title label;

acquiring each title in the at least one title contained in the target webpage according to the title label;

wherein the target tag comprises at least one of: a label shown by H1-H5, a label for property setting by label Style, and a label for introducing CSS Style by Class.

Optionally, the obtaining each title of the at least one title included in the target webpage by obtaining a URL associated with the target webpage or a title input associated with the webpage source code includes:

determining a plurality of candidate titles from the title input along with the URL of the target webpage or the source code of the webpage;

and determining each title in the at least one title contained in the target webpage from the plurality of candidate titles according to the text similarity between each candidate title and the target webpage.

Optionally, the method further comprises: and preprocessing the webpage source code before constructing a corresponding DOM tree according to the webpage source code.

Optionally, the web page source code is preprocessed, which includes at least one of:

carrying out normalization processing on the webpage source codes to obtain the webpage source codes in a target format;

and cleaning the webpage source code in the target format through a regular expression or an XML path language.

Optionally, the method further comprises:

determining whether the target webpage has an associated attachment outside the text content;

and under the condition that the target webpage has the associated attachment outside the text content, acquiring the text content of the associated attachment.

Optionally, determining whether the target webpage has an associated attachment outside the body content includes:

determining whether the target webpage contains target information outside the text content;

determining that the target webpage has an associated attachment outside the body content under the condition that the target webpage is determined to contain the target information outside the body content;

wherein the target information comprises at least one of: the suffix comprises an anchor point link of at least one of PDF, XLSX, XLS, DOC and DOCX, the anchor point link is in a folder form, and the text format comprises an anchor text of at least one of PDF, XLSX, XLS, DOC and DOCX.

Optionally, the method further comprises:

after the text content is obtained, filtering impurity content from the text content;

wherein the impurity content comprises: the length of the text written in the single label is smaller than the preset character length and the text has no content of punctuation marks.

In a second aspect of the embodiments of the present disclosure, there is provided a device for acquiring a text of a web page, including:

the acquisition module is configured to acquire a webpage source code of a target webpage;

the building module is configured to build a corresponding DOM tree according to the webpage source code;

the execution module is configured to generate a corresponding node list according to the text density of each child node in the DOM tree;

and the positioning module is configured to, for each title in at least one title included in the target webpage, position the text content in the target webpage and acquire the text content at least according to a position relationship between the text content corresponding to the text node and the text title corresponding to the text node in the target webpage when the text node matching the current title exists in the node list.

In a third aspect of the embodiments of the present disclosure, a computer-readable storage medium is further provided, on which a computer program is stored, and the computer program, when executed by a processor, implements the steps of the method for acquiring a text of a web page.

In a fourth aspect of the embodiments of the present disclosure, there is also provided an electronic device, including:

a memory having a computer program stored thereon;

a processor for executing the computer program in the memory to implement the methods of the embodiments of the present disclosure.

According to the method and the device, the corresponding DOM tree is constructed through the acquired webpage source code of the target webpage, the corresponding node list is generated according to the text density of each sub-node of the DOM tree, and the position of the text content in the target webpage is located and the text content node list is acquired according to the position relation between the text content corresponding to the text node and the text title corresponding to the text node in the target webpage under the condition that the six-standard node has the text node matched with the current title.

Drawings

The accompanying drawings, which are included to provide a further understanding of the embodiments of the disclosure and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure without limiting the disclosure. In the drawings:

FIG. 1 is a flow diagram illustrating a method for web page text retrieval in accordance with an exemplary embodiment;

FIG. 2 is a diagram illustrating building a DOM tree in accordance with an illustrative embodiment;

FIG. 3 is a flow diagram illustrating another method of web page text retrieval in accordance with an illustrative embodiment;

FIG. 4 is a block diagram illustrating a web page text acquisition apparatus in accordance with an illustrative embodiment;

FIG. 5 is a block diagram illustrating an electronic device in accordance with an example embodiment.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure.

It should be noted that in the present disclosure, the terms "S101", "S102" and the like in the description and claims and the drawings are used for distinguishing the steps, and are not necessarily to be construed as performing the method steps in a specific order or sequence.

It should be understood that in the related art, the body content in the web page is mainly determined and acquired according to the size of the text density, but only considering that the text density may be interfered by information such as the announcement content, the advertisement content, and the recommendation extension content, so that the acquired body content is inaccurate.

In view of this, the embodiments of the present disclosure provide a method, an apparatus, a storage medium, and an electronic device for acquiring a text content of a web page, which can locate and acquire the text content of the web page based on a text title and a text density, so that the text content of the web page can be acquired more accurately and efficiently, the problem that the text content of the web page acquired from the web page is inaccurate due to interference of information such as similar announcement content, advertisement content, recommended extension content, and the like is avoided, time consumed by manually locating the text content of the web page can be greatly reduced, and further, the acquisition efficiency and accuracy of the text content of the web page can be improved.

The following detailed description of specific embodiments of the present disclosure is provided in connection with the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating the present disclosure, are given by way of illustration and explanation only, not limitation.

Fig. 1 is a flowchart illustrating a method for acquiring a text of a web page according to an exemplary embodiment, the method including:

in step S101, a web page source code of the target web page is acquired.

In step S102, a DOM tree is built according to the web page source code.

In step S103, a corresponding node list is generated according to the text density of each child node in the DOM tree.

In step S104, for each of at least one title included in the target web page, when a text node matching the current title exists in the node list, at least according to a positional relationship between the text content corresponding to the text node and the text title corresponding to the text node in the target web page, a position of the text content in the target web page is located and the text content is acquired.

In step S101, the web page source code of the target web page may be acquired in two ways. Firstly, a user can directly input a webpage source code of a target webpage; secondly, the user can input the URL of the target webpage, and then the system can load the corresponding target webpage through the URL input by the user and then acquire the corresponding webpage source code according to the loaded target webpage.

In step S102, the process of building the DOM tree from the web page source code may include: the HTML file (web page source code) is converted into a DOM tree structure by an HTML Parser (HTML Parser) inside the rendering engine. Referring to fig. 2, the conversion process includes: calling an HTML Tokenizer class to analyze a byte stream (Bytes) of an HTML file into a character stream (Characters), calling an XSS editor class to analyze the character stream into words (Tokens), calling an HTML Document Parser class or an HTML Tree Builder class to analyze the words into a plurality of nodes, calling an HTML Construction Site class to establish the nodes into a DOM Tree, and realizing each link by calling a corresponding class in the whole process.

The parsing process of the HTML parser may include: after the network process receives the response header, the Type of the HTML file is judged according to the Content-Type field in the response header, if the value of the Content-Type is 'text/HTML', the browser judges that the HTML file is the HTML file, selects a corresponding analysis engine according to the judgment result, and then selects or creates a rendering process corresponding to the analysis engine. After the rendering process is prepared, a data sharing channel is established between the network process and the rendering process, the network process receives the data and then places the data in the channel, and the rendering process continuously reads the data from the other end of the channel and simultaneously transmits the read data to the HTML parser.

For example, in step S103, generating a corresponding node list according to the text density of each child node of the DOM tree may include: and sequencing the child nodes according to the sequence of the text density of the child nodes of the DOM tree from large to small to generate a node list.

Or, for example, in step S103, the child nodes may also be sorted according to the order from small to large of the text density of the child nodes of the DOM tree, so as to generate a node list.

In step S104, when a text node matching the current title (which may be any one of the at least one title) exists in the node list for each title in the at least one title included in the target web page, the current title is characterized as the text title of the target web page, so that for the text node, the position of the text content in the target web page can be located and the text content can be obtained according to the position relationship between the text content corresponding to the text node and the text title corresponding to the text node in the target web page.

It should be understood that the matched body node only corresponds to a small portion of the text content in the target web page, and the text content included in the portion of the text content is normally body content.

It should also be understood that for all the text titles in the target web page, the corresponding text content can be located and obtained according to the method provided in step S104.

In contrast to obtaining the body content of a web page based solely on text density, embodiments of the present disclosure construct a DOM tree based on the source code of the web page, and generate a list of nodes based on the text density of each child node of the DOM tree, under the condition that the node list has a text node matched with the current title, positioning the position of the text content in the target webpage and acquiring the text content according to the position relation between the text content corresponding to the text node and the text title corresponding to the text node in the target webpage, therefore, the method can more accurately acquire the text content of the webpage, avoid the problem that the text of the webpage acquired from the webpage is inaccurate due to the interference of information such as similar bulletin content, advertisement content, recommended extended content and the like, the time consumed by manually positioning the webpage text can be greatly reduced, and the acquisition efficiency and accuracy of the webpage text can be improved. The method is not influenced by website version change, namely, after the website structure is updated, the text content of the webpage can be accurately obtained without secondary development, so that the labor cost can be reduced.

The method shown in fig. 1 is illustrated in detail below with reference to specific examples.

It should be understood that when the position of the text content in the target web page is located only according to the position relationship between the text content corresponding to the text node and the text title corresponding to the text node in the target web page, there may be a possibility of error location, for example, if the annotation of the web page contains both the text content and the title, according to the method provided in fig. 1, the content in the annotation is likely to be mistaken as the text content of the web page and extracted. In order to solve the problem, the embodiment of the disclosure can further combine the ancestor node of the text node to realize more accurate positioning on the text content of the webpage.

In an optional embodiment, at least positioning the position of the text content in the target webpage and acquiring the text content according to the position relationship between the text content corresponding to the text node and the text title corresponding to the text node in the target webpage may include:

Because the text node or the annotation node can be identified according to the attribute of the related ancestor node, other contents, such as annotation contents, except the text contents can be eliminated by combining the ancestor node of the text node on the basis of the text contents corresponding to the text node and the text titles corresponding to the text node.

Further, according to at least one ancestor node of the text node, determining a minimum block of a common block of the text title and the text content in the target webpage by combining the text content corresponding to the text node and the position relation of the text title corresponding to the text node in the target webpage, and using the position of the minimum block as the position of the text content to realize the positioning of the position of the text content in the target webpage. In the embodiment of the disclosure, the position of the text content in the webpage is determined through the ancestor node of the text node and the node positions of the text content and the text title, so that the text content can be conveniently obtained, on the basis of the position relationship between the text content corresponding to the text node and the text title corresponding to the text node in the target webpage, the position of the text content in the target webpage is located by combining at least one ancestor node of the text node, the situation that the text content and other contents are simultaneously contained in the position is avoided, and the accuracy of the obtained text content is improved.

In an alternative embodiment, the text density of each child node in the DOM tree may be obtained by:

and aiming at each child node in the DOM tree, carrying out logarithm operation on the number of the titles contained in the child node and the number of the characters contained in the text corresponding to each title to obtain the text density of the child node.

In the embodiment of the disclosure, the text densities of the child nodes of the DOM tree are obtained, and the node list is generated according to the order of the text densities of the child nodes from large to small, so that whether the text node matched with the current title exists in the node list or not can be conveniently judged subsequently, and the position of the text content in the target webpage is positioned according to the judgment result, so that the time for obtaining the text content can be shortened, and the obtaining efficiency of the text content is improved.

In an alternative embodiment, the text node in the node list that matches the current title may be determined by:

and taking the first target node as a text node matched with the current title.

Further, determining whether the node is a webpage anchor according to the characteristics of the anchor in the target webpage, and if the node does not contain the characteristics of the anchor, determining that the node is a non-webpage anchor, wherein the characteristics of the anchor include: can be quickly located and is a hyperlink in the page.

Further, according to the sequence from large to small of the text density of each child node in the DOM tree, NLP (Natural Language Processing) is used to determine whether a node exists in the node list, which has a correlation degree with the current title reaching a preset value and is a non-web anchor.

The preset value may be preset according to the acquisition precision of the text content, which is not specifically limited by the present disclosure.

According to the method, the NPL is adopted to determine the nodes with the correlation degree reaching the preset value with the current title from the node list according to the sequence of the text density of each child node in the DOM tree from large to small, and further the nodes which are not web page anchors from the nodes with the correlation degree reaching the preset value are determined as the text nodes, so that the position of the text content in the target web page is located and the text content is obtained according to the position relation of the text content corresponding to the text nodes and the text title corresponding to the text nodes in the target web page, and the efficiency of obtaining the web page text in the target web page is improved.

The method is a processing method under the condition that a text node matched with the current title exists in the node list, and the method needs to determine the text node to be selected as the text node under the condition that the text node matched with the current title does not exist in the node list, wherein the text node to be selected corresponds to most text contents, but not all the text contents are the text contents, and for this purpose, the text contents can be obtained in the following way.

In an optional embodiment, further comprising:

under the condition that the text node matched with the current title does not exist in the node list, circularly traversing the node list until the traversal times reach the preset times so as to determine a second target node with the highest relevance with the current title in the node list;

The preset times may be preset according to the requirement of the user for obtaining precision of the text content, which is not specifically limited in the embodiment of the present disclosure.

In the embodiment of the disclosure, under the condition that no text node exists in the node list, the node list is traversed for many times, the text node is determined according to the correlation degree between the content corresponding to each node and the current title, that is, the text node is determined according to the feature that the current title is positively correlated with the text content, the position of the text content in the target webpage is located, and the text content is obtained, so that the failure of obtaining the text content due to the absence of the text node in the node list is avoided. Because external factors influence the relevance in the process of traversing the node list, if the text node is determined by traversing the node list once, the obtained text content has a large error or low precision, and therefore the reliability of determining the text node by traversing the node list for multiple times is high, and the obtaining precision of the text content is improved. And traversing according to the sequence of the text density of each child node in the node list from large to small every time, so that the time for determining the text node is shortened, and the time for acquiring the text content is further shortened.

Because the body title in the webpage is positively correlated with the body content, the text density coefficient of the body content is higher in the text related in the whole webpage, and the position of the body title in the webpage is generally in front of the body content, the body title needs to be determined before the body content is acquired. The text header in the target web page generally has a prominent attractive tag feature, but there are cases where the text header is used as the web page header of the target web page, and thus the text header can be obtained in at least two ways.

In an embodiment, each of the at least one title included in the target webpage may be obtained by at least one of the following methods:

acquiring each title of at least one title contained in a target webpage according to the label characteristics of the titles in the target webpage;

each title in the at least one title included in the target web page is obtained by obtaining a URL associated with the target web page or a title entered along with the source code of the web page.

Further, the title of a web page is the title of the home page of the website, which is a high summary of a web page. The text titles are the titles of the article contents of each part in the webpage.

In the embodiment of the disclosure, the text title in the target webpage can be obtained according to the tag feature of the text title, and the target title can be determined as the text title from a plurality of titles input along with the URL or the webpage source code of the target webpage.

In an embodiment, obtaining each title of at least one title included in the target web page according to the tag feature of the title in the target web page includes:

determining a target label in the target webpage as a title label;

acquiring each title in at least one title contained in the target webpage according to the title label;

wherein the target tag comprises at least one of: a label shown by H1-H5, a label for property setting by label Style, and a label for introducing CSS (Cascading Style Sheet) Style by Class.

The labels displayed by the H1-H5 labels can be understood as partially explaining the importance of each text in the web page by using H1, H2, H3, H4 and H5, for example, H1 represents the most important text content in the web page, and H5 represents the relatively least important text content in the web page. The tag for property setting by the tag Style may be understood as defining Style information for an HTML (Hyper Text Markup Language) document in a web page by the tag Style.

Because the text title in the webpage comprises the three characteristics, the text title in the target webpage is obtained according to the general characteristics of the text title in the webpage, so that the text title is accurately obtained, and the subsequent text content is conveniently positioned.

When the body title is used as the web page title of the target web page, a plurality of titles are obtained, and only part of the titles are the body title, so that the body title needs to be determined from the plurality of titles.

In an optional embodiment, obtaining each title of the at least one title included in the target webpage by obtaining a URL associated with the target webpage or a title input with a source code of the webpage includes:

determining a plurality of candidate titles from the URL of the target webpage or the title input along with the webpage source code;

and determining each title in at least one title contained in the target webpage from a plurality of candidate titles according to the text similarity between each candidate title and the target webpage.

Among the candidate titles, the title with the highest text similarity to the body content may be used as the body title.

Further, the method for determining the text similarity adopted in the embodiment of the present disclosure may be a Jaccard algorithm, a ShingLing algorithm, a minimum hash algorithm, and the like, which is not specifically limited in the present disclosure.

According to the method and the device for determining the text similarity of the target webpage, under the condition that the text title is used as the webpage title of the target webpage, the URL input along with the target webpage or the title input along with the webpage source code is used as the candidate title, the text similarity of each candidate title and the text content is further determined, and therefore the title with the highest text similarity with the target webpage in the candidate titles is used as the text title, and the text content in the target webpage can be conveniently located and obtained subsequently according to the text title.

Because different web page developers may have different encoding habits and encoding styles, the web page formats and the writing methods may be different, some formats may not even meet the requirements of browsers, the web page structure may be changed when a DOM tree is built, and the acquired text content may be inaccurate due to noise in a target web page when the text content in the target web page is subsequently acquired, so that the web page source code may be preprocessed before the DOM tree is built according to the web page source code.

Therefore, in an embodiment, the method further includes: and preprocessing the webpage source code before constructing a corresponding DOM tree according to the webpage source code.

In one embodiment, the web page source code is preprocessed, including at least one of:

carrying out normalization processing on the webpage source code to obtain a webpage source code in a target format;

The target format may be a regular expression or an XML Path Language (XPATH) parsable format. The related algorithms involved in the normalization process, regular expressions and XML path language may be well known to those skilled in the art and are not described herein.

For example, the web page source code is normalized to implement preprocessing of the web page source code. By carrying out the normalization processing on the webpage source codes, the DOM tree construction failure caused by different webpage formats, writing methods and the like can be avoided.

For example, the webpage source code is cleaned through a regular expression or an XML path language, so as to implement preprocessing of the webpage source code. By cleaning the webpage source code, information irrelevant to the text content, such as a webpage navigation bar, a recommendation bar, a header, a footer and the like in the target webpage, is eliminated, and inaccuracy of the acquired text content caused by noise in the target webpage is avoided.

For example, the web page source code is normalized to obtain a regular expression or XPATH resolvable target format web page source code, and then the data of the target format web page source code is cleaned through the regular expression or XML path language.

According to the webpage source code normalization processing method, the webpage source code is subjected to normalization processing, the regular expression or the XPATH parsable target format is obtained, and the phenomenon that webpage formats and writing methods are different due to different writing methods of webpage developers, some formats even do not meet the requirements of browsers, and when a DOM tree is built, the webpage structure is changed, so that the text content acquisition failure or the DOM tree building failure is caused is avoided. And then, the webpage source code in the target format is cleaned through the regular expression or XPATH, the problem of HTML writing errors is checked and corrected through the regular expression, and information irrelevant to the text content, such as a webpage navigation bar, a recommendation bar, a header, a footer and the like in the target webpage is eliminated, so that the inaccuracy of the acquired text content caused by the noise in the target webpage is avoided, and the accuracy of the acquired text content is improved.

When the attachment information of the web page is outside the text content, the related attachments outside the text content in the web page can be missed by the technical means disclosed above.

In one embodiment, the method further comprises:

According to the embodiment of the disclosure, whether the associated attachment exists outside the text content of the target webpage can be determined in any step of acquiring the text content of the target webpage, for example, in the step of acquiring the webpage source code of the target webpage or the step of constructing the corresponding DOM tree according to the webpage source code, so that the text content of the associated attachment of the target webpage outside the text content is prevented from being missed. The method for acquiring the text content of the associated attachment is the same as or similar to the method for acquiring the text content in the target webpage.

Because the associated attachment can exist in the webpage in various ways, such as an anchor link in the form of a file, an anchor link in the form of a folder, or a keyword of a file part in the form of anchor text, the embodiment of the present disclosure determines whether the associated attachment exists in the target webpage outside the body content according to the existence form of the associated attachment in the webpage.

In one embodiment, determining whether the target web page has an associated attachment outside the body content includes:

determining that the target webpage has the associated attachment outside the text content under the condition that the target webpage contains the target information outside the text content;

wherein the target information comprises at least one of: the suffix comprises anchor point links of at least one of PDF, XLSX, XLS, DOC and DOCX, the anchor point links in a folder form, and the text format comprises anchor text of at least one of PDF, XLSX, XLS, DOC and DOCX.

According to the embodiment of the disclosure, whether the target webpage has the associated attachment or not is determined according to the existence form of the associated attachment in the webpage, so that the content of the associated attachment of the target webpage outside the text content is prevented from being missed.

Because other impurity information except the associated attachments may exist in the webpage, the other impurity information needs to be removed before the data is put into a database.

Therefore, according to the embodiment of the disclosure, only the content of the associated attachment of the target webpage except the text content is acquired, and other information except the webpage text and the associated attachment in the target webpage can be removed, so that interference of other information on acquiring the text content and the associated attachment is avoided, and the accuracy of acquiring the text content can be improved.

Because private domain information such as advertisements, friend circle risks, printing traces and other information irrelevant to the text content can be injected into the text content of the webpage in part of the webpage, impurities in the text content can be removed after the text content of the webpage is obtained, and the purity of the extracted text content is improved.

In one embodiment, the method further comprises: after the text content is obtained, filtering impurity content from the text content; wherein the impurity content comprises: the length of the text written in the single label is smaller than the length of the preset character, and the text has no content of word order punctuation.

According to the embodiment of the disclosure, foreign information irrelevant to the text content, such as advertisements, friend circle risks, printing traces and the like, in the text content is removed according to the preset conditions, so that the acquisition precision of the text content is improved, and the influence of the extracted text content by the information of the advertisements and the like is avoided.

Fig. 3 is another flowchart illustrating a method for acquiring text of a web page according to an exemplary embodiment, referring to fig. 3, the method includes:

in step S201, a web page source code of the target web page is acquired.

The web page source code may be collected from the web page through a web page URL (Uniform Resource Locator), or may be directly input by the user.

In step S202, the web page source code is preprocessed.

The method for preprocessing the webpage source code comprises the following steps: the web page source code in the target format is cleaned through a regular expression or XPATH by carrying out normalization processing on the web page source code; or after the web page source code is normalized, a regular expression or an XPATH resolvable target format is obtained, and then the web page source code of the target format is cleaned through the regular expression or the XPATH.

In step S203, a corresponding DOM tree is built according to the web page source code.

In step S204, a corresponding node list is generated according to the text density of each child node in the DOM tree.

Specifically, for each child node in the DOM tree, performing logarithm operation based on the number of titles contained in the child node and the number of characters contained in the text corresponding to each title to obtain the text density of the child node; and sequencing according to the sequence of the text density of each child node from large to small to generate a corresponding node list.

In step S205, in the case that there is a text node matching the current title in the node list, the first target node, which is a non-web anchor and has a relevance to the current title reaching a preset value, in the node list is taken as the text node.

In step S206, the position of the text content in the target webpage is located and the text content is obtained at least according to the position relationship between the text content corresponding to the text node and the text title corresponding to the text node in the target webpage.

In step S207, the foreign content is filtered out from the text content.

Specifically, the text written in the single label can be filtered out when the length of the text is smaller than the length of the preset character and the text has no punctuation marks, so that impurities irrelevant to the text content, such as advertisements, friend circle risks, printing traces and the like, in the text content can be eliminated.

In step S208, it is determined whether the target web page has an associated attachment outside the body content.

Specifically, whether the target webpage contains target information outside the text content is determined, and under the condition that the target webpage contains the target information outside the text content, the target webpage is determined to have the associated attachment outside the text content; wherein the target information comprises at least one of: the suffix comprises anchor point links of at least one of PDF, XLSX, XLS, DOC and DOCX, the anchor point links in a folder form, and the text format comprises anchor text of at least one of PDF, XLSX, XLS, DOC and DOCX.

In step S209, the body content of the associated attachment is acquired.

The method for acquiring the text content of the associated attachment is the same as or similar to the method for acquiring the text content in the target webpage, and the method is not described in the embodiment of the disclosure.

In step S210, the data is binned.

And storing the text content after the impurities are removed and the text content of the associated attachment in a database for later use.

The embodiment of the disclosure can analyze a single webpage, input the webpage source code and output the text content. However, for a single website, the position of the text content is fixed, so the method for acquiring the position of the text content of the webpage can be generated according to the technical scheme that the webpage information is received to the position for acquiring the text content, the specific position of the text content in a certain webpage in the website can be determined through the method for acquiring the position of the text content of the webpage, the text content can be acquired according to the specific position of the text content, and the method for acquiring the position of the text content of the webpage can be suitable for other webpages, which are the same as the webpage, on the website.

Based on the same inventive concept, the present disclosure provides a web page text obtaining apparatus, referring to fig. 4, the apparatus 400 includes an obtaining module 401, a constructing module 402, an executing module 403, and a positioning module 404.

The obtaining module 401 is configured to obtain a web page source code of a target web page.

The building module 402 is configured to build a corresponding DOM tree from the web page source code.

The execution module 403 is configured to generate a corresponding node list according to the text density of each child node in the DOM tree.

The positioning module 404 is configured to, for each of at least one title included in the target web page, in a case that a text node matching the current title exists in the node list, position the text content in the target web page and acquire the text content at least according to a positional relationship between the text content corresponding to the text node and the text title corresponding to the text node in the target web page.

According to the method and the device, the DOM tree is built through the webpage source code, the node list is generated according to the text density of each sub-node of the DOM tree, and under the condition that the text node matched with the current title exists in the node list, the position of the text content in the target webpage is located and the text content is obtained according to the position relation between the text content corresponding to the corrected text node and the text title corresponding to the text node in the target webpage, so that the text content of the webpage can be obtained more accurately, the problem that the webpage text obtained from the webpage is inaccurate due to interference of information such as similar announcement content, advertisement content, recommended extension content and the like is avoided, the time consumed by manually locating the webpage text can be greatly reduced, and the acquisition efficiency and accuracy of the webpage text can be improved. The method is not influenced by website version change, namely, after the website structure is updated, the text content of the webpage can be accurately obtained without secondary development, so that the labor cost can be reduced.

Further, the positioning module 404 is configured to position the text content in the target webpage and obtain the text content according to at least one ancestor node of the text node and a position relationship between the text content corresponding to the text node and the text title corresponding to the text node in the target webpage.

Further, the execution module 403 is configured to, for each child node in the DOM tree, perform a logarithm operation based on the number of titles included in the corresponding child node and the number of characters included in the text corresponding to each title to obtain the text density of the corresponding child node.

Further, the locating module 404 is configured to determine that the relevance of the current title in the node list reaches a preset value and is a first target node that is not a web page anchor point; and taking the first target node as a text node matched with the current title.

Further, the apparatus 400 further includes a control module configured to cycle through the node list until the number of traversal times reaches a preset number of times, so as to determine a second target node in the node list that has a highest correlation with the current title; taking the second target node as a text node corresponding to the current title; and positioning the position of the text content in the target webpage and acquiring the text content at least according to the position relation between the text content corresponding to the text node and the text title corresponding to the text node in the target webpage.

Further, the positioning module 404 is configured to obtain each title of the at least one title included in the target webpage according to the tag feature of the title in the target webpage; each title in the at least one title included in the target web page is obtained by obtaining a URL associated with the target web page or a title entered along with the source code of the web page.

Further, the positioning module 404 is configured to determine a target tag in the target webpage as a title tag; acquiring each title in at least one title contained in the target webpage according to the title label; wherein the target tag comprises at least one of: a label shown by H1-H5, a label for property setting by label Style, and a label for introducing CSS Style by Class.

Further, the locating module 404 is configured to determine a plurality of candidate headings from the URL of the target webpage or the heading input with the source code of the webpage; and determining each title in at least one title contained in the target webpage from the plurality of candidate titles according to the text similarity between each candidate title and the target webpage.

Further, the apparatus 400 further includes a preprocessing module configured to preprocess the source code of the web page.

Further, the preprocessing module is configured to perform normalization processing on the web page source code to obtain a web page source code in a target format; and/or cleaning the webpage source code of the target format through a regular expression or an XML path language.

Further, the apparatus 400 further includes a determining module configured to determine whether an associated attachment exists outside the text content of the target webpage; and under the condition that the target webpage has the associated attachment outside the text content, acquiring the text content of the associated attachment.

Further, the determining module is configured to determine whether the target webpage contains target information outside the text content; determining that the associated attachment exists in the target webpage under the condition that the target webpage comprises target information outside the text content; wherein the target information comprises at least one of: the suffix comprises an anchor link of at least one of PDF, XLSX, XLS, DOC and DOCX; anchor point links in folder form; the text format includes anchor text of at least one of PDF, XLSX, XLS, DOC, DOCX.

Further, the apparatus 400 further comprises a filtering module configured to filter the foreign content from the text content; wherein the impurity content comprises: the length of the text written in the single label is smaller than the preset character length and the text has no content of punctuation marks.

Furthermore, it should be noted that, for convenience and brevity of description, the embodiments described in the specification all belong to the preferred embodiments, and the related parts are not necessarily essential to the present invention, for example, the obtaining module and the constructing module may be independent devices or may be the same device when being implemented specifically, and the disclosure is not limited thereto.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Based on the same inventive concept, the present disclosure also provides a computer-readable storage medium on which computer program instructions are stored, which, when executed by a processor, implement the steps of the web page text acquisition method provided by the present disclosure.

Specifically, the computer-readable storage medium may be a flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, a server, a public cloud server, etc.

With respect to the computer-readable storage medium in the above embodiments, the steps of the method for implementing task execution when the computer program stored thereon is executed will be described in detail in relation to the embodiments of the method, and will not be elaborated herein.

Based on the same inventive concept, the present disclosure also provides an electronic device, including:

a memory having a computer program stored thereon;

and the processor is used for executing the computer program in the memory so as to realize the steps of the webpage text acquisition method.

The method includes the steps of constructing a DOM tree through a webpage source code, generating a node list according to the text density of each child node of the DOM tree, and positioning the position of the text content in a target webpage and acquiring the text content according to the position relation between the text content corresponding to a corrected text node and the text title corresponding to the text node in the target webpage under the condition that the text node matched with the current title exists in the node list. The time for manually positioning the webpage text is greatly reduced, the interference of information such as similar announcement content, recommended extended content and the like is avoided, and the acquisition efficiency of the webpage text is improved.

Fig. 5 is a block diagram illustrating an electronic device 500 in accordance with an example embodiment. As shown in fig. 5, the electronic device 500 may include: a processor 501 and a memory 502. The electronic device 500 may also include one or more of a multimedia component 503, an input/output (I/O) interface 504, and a communication component 505.

The processor 501 is configured to control the overall operation of the electronic device 500, so as to complete all or part of the steps in the webpage text obtaining method. The memory 502 is used to store various types of data to support operations at the electronic device 500, such as instructions for any application or method operating on the electronic device 500, and application-related data, such as web page source code, body titles, body nodes, body content, and so forth. The Memory 502 may be implemented by any type of volatile or non-volatile Memory device or combination thereof, such as Static Random Access Memory (SRAM), Electrically Erasable Programmable Read-Only Memory (EEPROM), Erasable Programmable Read-Only Memory (EPROM), Programmable Read-Only Memory (PROM), Read-Only Memory (ROM), magnetic Memory, flash Memory, magnetic disk or optical disk. The multimedia component 503 may include a screen and an audio component. Wherein the screen may be, for example, a touch screen and the audio component is used for outputting and/or inputting audio signals. For example, the audio component may include a microphone for receiving external audio signals. The received audio signal may further be stored in the memory 502 or transmitted through the communication component 505. The audio assembly also includes at least one speaker for outputting audio signals. The I/O interface 504 provides an interface between the processor 501 and other interface modules, such as a keyboard, mouse, buttons, etc. These buttons may be virtual buttons or physical buttons. The communication component 505 is used for wired or wireless communication between the electronic device 500 and other devices. Wireless Communication, such as Wi-Fi, bluetooth, Near Field Communication (NFC), 2G, 3G, 4G, NB-IOT, eMTC, or other 5G, etc., or a combination of one or more of them, which is not limited herein. The corresponding communication component 505 may thus comprise: Wi-Fi module, Bluetooth module, NFC module, etc.

In an exemplary embodiment, the electronic Device 500 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic components for performing the web page text acquisition method.

In another exemplary embodiment, a computer program product is also provided, which comprises a computer program executable by a programmable apparatus, the computer program having code portions for performing the web page text acquisition method described above when executed by the programmable apparatus.

The preferred embodiments of the present disclosure are described in detail with reference to the accompanying drawings, however, the present disclosure is not limited to the specific details of the above embodiments, and various simple modifications may be made to the technical solution of the present disclosure within the technical idea of the present disclosure, and these simple modifications all belong to the protection scope of the present disclosure.

It should be noted that the various features described in the above embodiments may be combined in any suitable manner without departing from the scope of the invention. In order to avoid unnecessary repetition, various possible combinations will not be separately described in this disclosure.

In addition, any combination of various embodiments of the present disclosure may be made, and the same should be considered as the disclosure of the present disclosure, as long as it does not depart from the spirit of the present disclosure.

Claims

1. A webpage text acquisition method comprises the following steps:

acquiring a webpage source code of a target webpage;

constructing a corresponding DOM tree according to the webpage source code;

2. The method of claim 1, wherein positioning the position of the text content in the target webpage and acquiring the text content at least according to the position relationship between the text content corresponding to the text node and the text title corresponding to the text node in the target webpage comprises:

3. The method of claim 1, wherein the text density of each child node in the DOM tree is obtained by:

4. The method of claim 1, wherein the body node in the node list that matches the current title is determined by:

and taking the first target node as a text node matched with the current title.

5. The method of claim 1, further comprising:

6. The method of claim 1, further comprising obtaining each of the at least one title contained in the target web page by at least one of:

7. The method of claim 6, wherein the obtaining each title of the at least one title included in the target webpage according to the tag feature of the title in the target webpage comprises:

determining a target label in the target webpage as a title label;

8. The method of claim 6, wherein the obtaining each title of the at least one title contained in the target webpage by obtaining a title entered with the URL of the target webpage or with the webpage source code comprises:

9. The method of claim 1, further comprising: and preprocessing the webpage source code before constructing a corresponding DOM tree according to the webpage source code.

10. The method of claim 9, wherein preprocessing the web page source code comprises at least one of:

11. The method of claim 1, further comprising:

12. The method of claim 11, wherein determining whether the target web page has an associated attachment outside of the body content comprises:

wherein the target information comprises at least one of:

the suffix comprises an anchor link of at least one of PDF, XLSX, XLS, DOC and DOCX;

anchor point links in folder form;

the text format includes anchor text of at least one of PDF, XLSX, XLS, DOC, DOCX.

13. The method of claim 1, further comprising:

14. A web page text acquisition apparatus comprising:

15. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method of any one of claims 1-13.

16. An electronic device, comprising:

a memory having a computer program stored thereon;

a processor for executing the computer program in the memory to implement the method of any one of claims 1-13.

17. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-13.