CN115658993A - Intelligent extraction method and system for core content of webpage - Google Patents

Intelligent extraction method and system for core content of webpage Download PDF

Info

Publication number
CN115658993A
CN115658993A CN202211183811.0A CN202211183811A CN115658993A CN 115658993 A CN115658993 A CN 115658993A CN 202211183811 A CN202211183811 A CN 202211183811A CN 115658993 A CN115658993 A CN 115658993A
Authority
CN
China
Prior art keywords
rule template
text
webpage
core
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202211183811.0A
Other languages
Chinese (zh)
Other versions
CN115658993B (en
Inventor
吕璞鑫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Canglang Health Management Co ltd
Hangzhou Clove Health Management Co ltd
Hangzhou Lianke Meixun Biomedical Technology Co ltd
Guanlan Networks Hangzhou Co ltd
Original Assignee
Hangzhou Canglang Health Management Co ltd
Hangzhou Clove Health Management Co ltd
Hangzhou Lianke Meixun Biomedical Technology Co ltd
Guanlan Networks Hangzhou Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Canglang Health Management Co ltd, Hangzhou Clove Health Management Co ltd, Hangzhou Lianke Meixun Biomedical Technology Co ltd, Guanlan Networks Hangzhou Co ltd filed Critical Hangzhou Canglang Health Management Co ltd
Priority to CN202211183811.0A priority Critical patent/CN115658993B/en
Publication of CN115658993A publication Critical patent/CN115658993A/en
Application granted granted Critical
Publication of CN115658993B publication Critical patent/CN115658993B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The application discloses a method and a system for intelligently extracting core content of a webpage, wherein the method comprises the following steps: acquiring corresponding webpage structure information according to preset webpage address information; acquiring core text information from the webpage structure information based on a preset extraction mode; respectively generating a text rule template, a title rule template and a release time rule template according to the core text information and the webpage structure information; determining a universal rule template according to the text rule template, the title rule template and the release time rule template; and extracting the core content of the current webpage based on the universal rule template. The intelligent extraction can be realized according to the extraction algorithm based on the structural information of the web pages, a universal rule template can be generated, and the core content can be directly and quickly extracted from the same type of web pages through the rule template, so that the large-scale structured data extraction and monitoring can be efficiently carried out.

Description

Intelligent extraction method and system for core content of webpage
Technical Field
The application relates to the field of computer vision, in particular to an intelligent extraction method and system for core contents of a webpage.
Background
With the rapid development of the internet, whether enterprises or individuals, the demand for monitoring the data of the web pages is more and more obvious, for example, companies concerned about public sentiment want to extract core content from articles published by various large media to monitor the change of specific public sentiment in time; for example, an enterprise interested in hospital recruiting announcements wishes to extract core content from announcements in each hospital official network to monitor specific job changes in time.
The acquisition of the core content information is established on the basis of large-scale web page monitoring, and different web pages have different structures, so that the web pages have different data extraction rules, and before data extraction, corresponding adjustment needs to be performed on extracted scripts according to different web page structures.
The prior art still needs to artificially perform corresponding extraction configuration aiming at different webpages, when the number of the webpages is large, the workload of people is too large to complete the configuration in a short time, and thus, the large-scale data can be rapidly extracted and monitored.
Disclosure of Invention
The application aims to provide an automatic intelligent extraction and monitoring service without human participation, and large-scale rapid extraction and monitoring are carried out on the title, the release time and the core content of a webpage.
In a first aspect, the present application provides a method and a system for intelligently extracting core content of a web page, which adopt the following technical solutions:
an intelligent extraction method of core contents of a webpage comprises the following steps:
acquiring corresponding webpage structure information according to preset webpage address information;
acquiring core text information from the webpage structure information based on a preset extraction mode;
respectively generating a text rule template, a title rule template and a release time rule template according to the core text information and the webpage structure information;
determining a universal rule template according to the text rule template, the title rule template and the release time rule template;
and extracting the core content of the current webpage based on the universal rule template.
By the technical scheme, the text, the title and the release time of the webpage can be intelligently extracted according to the webpage address information, a universal rule template can be generated, and the core content of the same type of webpage can be quickly acquired through the universal rule template so as to realize large-scale monitoring of webpage data.
Optionally, the obtaining core text information from the webpage structure information based on the preset extraction manner includes:
constructing a DOM tree according to the webpage structure information;
calculating and obtaining the text density and the symbol density of each node in the DOM tree;
calculating scores of all nodes of the DOM tree through a preset algorithm according to the text density and the symbol density;
and sequencing all the nodes from high to low according to the corresponding scores, and taking the content contained in the node with the top ranking as core text information.
Optionally, the generating a text rule template, a title rule template, and a release time rule template according to the core text information and the web page structure information includes:
generating a label tree according to the webpage structure information, wherein the label tree comprises a plurality of leaf nodes, and each leaf node corresponds to an absolute path and text information;
acquiring corresponding text information by traversing leaf nodes of the label tree;
judging whether the text information exists in the core text information or not;
if the node exists, recording the absolute path of the current node as an effective path;
all the effective paths are combined to obtain the longest common path, namely the rule template of the text.
Optionally, the merging all the effective paths to obtain the longest common path includes:
carrying out pairwise similarity calculation on the effective paths to obtain corresponding similarity path values and form a similarity set;
calculating and obtaining mode and average number in the similarity set, and obtaining a threshold value point of the effective path;
selecting an effective path corresponding to the similarity path value of the threshold point which is greater than or equal to the effective path as a most similar effective path set;
acquiring the ratio of the maximum same path of the nodes at the same position to the total path at the position according to the most similar effective path set;
selecting the path with the maximum same ratio of the nodes with the ratio larger than a preset threshold value as an effective public path;
and acquiring the longest public path, namely the rule template of the text, from all the effective public paths.
Optionally, the generating a text rule template, a title rule template, and a release time rule template according to the core text information and the web page structure information includes:
acquiring text content of the title label according to the webpage structure information, recording the text content as an experimental group,
acquiring text contents of the h label and a class div label containing the title, and recording the text contents as a comparison group;
and calculating the longest common substring for the experimental group and the comparison group, comparing the text content lengths of the experimental group and the comparison group containing the longest common substring, and selecting one group with longer text content as a title and a rule template of the title.
Optionally, the generating a body rule template, a title rule template, and a release time rule template according to the core text information and the web page structure information respectively includes:
acquiring all character strings corresponding to date and time according to the webpage structure information;
sequentially carrying out time standardization processing on all the obtained character strings, and respectively obtaining corresponding standardization time;
selecting the standard time with the highest matching degree as the release time through a preset standard time template;
and reversely calculating an absolute path of the release time by using the label tree according to the text information corresponding to the release time, namely a release time rule template.
Optionally, determining a universal rule template according to the text rule template, the title rule template and the release time rule template, including:
generating a special rule template corresponding to the preset webpage address information according to the text rule template, the title rule template and the release time rule template, storing the special rule template in a preset database corresponding to the webpage category, counting the number of the special rule templates in the database, judging whether the number of the special rule templates reaches a preset value,
if not, using the current special rule template as a temporary general rule template;
if yes, screening out the universal rule template according to the special rule template in the database.
Optionally, before obtaining the core text information from the web page structure information, the method further includes:
judging whether a universal rule template corresponding to the preset webpage information exists or not,
if so, extracting the core content of the webpage by using a universal rule template;
if not, obtaining the core text information from the webpage structure information.
In a second aspect, the present application provides an intelligent extraction system for core content of a web page, including:
the webpage data acquisition module is used for acquiring corresponding webpage structure information according to preset webpage address information and acquiring core text information from the webpage structure information based on a preset extraction mode;
the rule template generating module is used for respectively generating a body rule template, a title rule template and a release time rule template according to the core text information and the webpage structure information, and then determining a universal rule template according to the body rule template, the title rule template and the release time rule template;
and the core content extraction module is used for extracting the core content of the current webpage based on the universal rule template.
In a third aspect, the present application provides a computer-readable storage medium storing a computer program that can be loaded by a processor and execute the above-mentioned method for intelligently extracting the core content of a web page.
To sum up, this application obtains the core content of webpage according to the algorithm that webpage address information accessible set for to generate corresponding rule masterplate, can obtain the general rule masterplate of this type of webpage to the category that the webpage corresponds through the rule masterplate, has had after general rule masterplate, to the core content of the webpage of the same type, can directly use general rule masterplate to extract fast to realize large-scale webpage data monitoring.
Drawings
Fig. 1 is a flowchart of an intelligent extraction method for core content of a web page according to an embodiment of the present application;
fig. 2 is a schematic flowchart of a process for acquiring core text information from web page structure information according to an embodiment of the present application;
FIG. 3 is a flow chart illustrating the generation of a text rule template provided by an embodiment of the present application;
FIG. 4 is a flow chart illustrating the generation of a release time rule template according to an embodiment of the present application;
fig. 5 is a schematic diagram of an intelligent extraction system for core content of a web page according to an embodiment of the present application.
Detailed Description
The present application is described in further detail below with reference to figures 1-5.
The application provides an intelligent extraction method of core content of a webpage, and referring to fig. 1, the method comprises the following steps:
and S100, acquiring corresponding webpage structure information according to preset webpage address information.
The preset webpage address information comprises a webpage type and an address url of the webpage, the webpage type is subdivided into news, science and education, entertainment, military and the like, and the webpage from which data is to be extracted can be positioned according to the webpage url; the webpage structure information is equivalent to a webpage source code and comprises elements such as webpage html, head and body, corresponding text and picture information, style css, footstep javascript and the like.
In the embodiment of the application, the web page structure information is acquired by using a script-redis frame through the web page address information, and the script-redis frame is used as a distributed crawler frame and is generally used for large-scale website data acquisition.
And S200, acquiring core text information from the webpage structure information based on a preset extraction mode.
In the embodiment of the application, the preset extraction mode is to perform modeling on the webpage based on the text density and the coincidence density to obtain the core text region, and the content contained in the core text region is the core text information.
The core component of the core text information is the text information of the body, because in the web page, the text density and the symbol density of the area where the body is located are the largest, and in addition to other areas, such as the top navigation bar, the side advertisement bar and the bottom information bar area, the text is few and has almost no text symbols. The core text region can be determined by using the text density and the symbol density as interpretation bases.
In the embodiment of the present application, based on a preset extraction manner, core text information is obtained from web page structure information, which is shown in fig. 2, and the method specifically includes the following steps:
s210, constructing a DOM tree according to the webpage structure information.
In the embodiment of the present application, since the area where the text is to be confirmed in the web page structure is located by the web page tag and the position of the web page structure where the text is located, the web page structure is firstly parsed according to the web page structure, for example, the web page structure can be parsed by a Parse function in a standard library provided by jsup to obtain a document object, i.e., a DOM tree, the DOM tree arranges elements or tags, such as < body >, < div >, < p >, < h >, < a > and the like, included in the current web page structure according to a hierarchy, and like a tree, the DOM tree continues to divide the sub nodes into new sub nodes from a root node to the sub nodes until the sub nodes are not divided, i.e., leaf nodes, and in the web page, the general text part, the picture, the hyperlink and the like are leaf nodes.
Before the DOM tree is built, webpage structure information is preprocessed, and js footsteps, css styles, iframes and the like are mainly removed, because the DOM tree is built without the information, interference information can be brought, in addition, because the content body of the webpage is usually in a body label, the body label is taken as a root node of the DOM tree, so that the composition of the DOM tree is simplified, and the subsequent calculation amount for modeling the text density and the symbol density is reduced.
S220, calculating and obtaining the text density and the symbol density of each node in the DOM tree.
In the embodiment of the application, the text density and the symbol density of each node can be calculated by traversing each node in the DOM tree.
In particular, the text density T of the node D The expression is as follows: t is D =(S-S a )/(L-L a )。
Wherein S is the plain text number of the node, S a Number of words with a links for a node, L number of labels for a node, L a The number of labels with a links for a nodes. The calculation formula based on the text density of the node can be known that when the number of the plain text words of a node is significantly larger than that of the text words with links, the text density of the node is high, and because the a-links can be in the form of texts, web page addresses url or pictures and the like, when the a-links exist in the form of texts, if the number of the texts is only considered, labels containing a-links in a large number are possibly formedThe nodes are used as the nodes where the text of the body is located, so the density of the computing nodes is more based on the number of the plain texts, and when the number of the plain texts is large and the percentage of the plain texts is large, the probability of belonging to the content of the body is higher.
Symbol density of nodes C D The expression is as follows: c D =M/(L-L a )。
Wherein M is the text number of the node with text symbols, L-L a For the number of the plain text labels of the node, a calculation formula based on the symbol density of the node can know that when the text number of the text symbols of the node is more, the plain text labels are less, and the symbol density of the node is larger.
And S230, calculating scores of all nodes of the DOM tree through a preset algorithm according to the text density and the symbol density.
After the text density and the symbol density of each node in the DOM tree are calculated, because the text density of some nodes is higher and the symbol density of some nodes is higher, how to determine which node is the node where the required core text information is located needs to set a judgment rule.
In one embodiment, the text density/symbol density is used as the score W of the corresponding node, and finally the most needed node is determined according to the score, wherein W represents: w = T D *C D
In another embodiment, since it is considered that under certain conditions, for example, the number of words in the last label of the web page is large, the density of the text may be 100/1, 100 is the number of words, and 1 is the number of the last label, which causes a large deviation in the density in the case of few labels, the dimension of the number of all labels in the node is increased, and the final score W is expressed as: w = L T D *C D
In the embodiment of the application, because different layouts of different types of web pages are considered, and the text density and the symbol density are respectively adopted for testing, the accuracy rate of using the text density is relatively higher, but the text density and the symbol density are accurate for different types of web page contentsThe accuracy also varies, so the preset algorithm gives different weights to the text density and the symbol density, and the weights can be adjusted according to different types of web pages, that is, the final node score W is expressed as: w = theta 1 *T D2 *C D
Wherein, theta 1 And theta 2 Representing the weight assigned to the text density and the symbol density, respectively. And theta 12 =1。
S240, sorting all the nodes from high to low according to the corresponding scores, and taking the content contained in the node with the top ranking as core text information.
In the embodiment of the application, after the scores of all nodes in the DOM tree are calculated according to the text density and the symbol density, the scores are sorted, and the text density and the symbol density are comprehensively considered, so that when the score of a node is larger, the probability that the node contains the text content is higher, and the content contained in the node with the highest ranking is used as the core text information.
And S300, respectively generating a text rule template, a title rule template and a release time rule template according to the core text information and the webpage structure information.
The text rule template, the title rule template and the release time template all refer to an xpath rule for extracting webpage data, for example, a text generally exists in a p label in a webpage structure, a title is contained by an h label, that is, the existence form of the data has a certain rule, and corresponding data can be directly extracted according to the rule.
In the embodiment of the application, after the core text information is obtained, a text rule template, a title rule template and a release time rule template are respectively generated according to the core text information and the webpage structure information, and after the templates are obtained, the text, the title and the release time of the webpage can be directly obtained according to the templates.
In the embodiment of the present application, a text rule template is generated according to the core text information and the web page structure information, and referring to fig. 3, the method specifically includes the following steps:
and S310, generating a label tree according to the webpage structure information.
The tag tree is similar to the DOM tree in structure, except that the data and data storage forms contained in the nodes are different, the root node of the tag tree starts with an html tag, child nodes are continuously generated to form a branch path, and the final leaf node does not generate a new node any more and is the terminal point.
The contents of all leaf nodes are triples, which are respectively the current node element object, the absolute path and the contained text information, and because some leaf nodes only contain pictures, a labels and the like, the nodes with the text information are certainly leaf nodes, but the leaf nodes do not necessarily all have the text information.
The label tree also has the characteristic of forward/backward, and information contained in leaf nodes of the label tree can be known only by a regular path in a forward way; the reverse expression indicates that the path of the label tree corresponding to the text information can be acquired as long as the text information is known.
In the embodiment of the present application, in order to ensure the uniqueness of each absolute path in the current web page structure, the absolute path is also specially processed: if the label is a flat label, the absolute path can be added with a corner mark form to ensure the accuracy of the rule. Such as: and the < html > < div > content level 1</div > < div > content level 2</div > < html >, the absolute path is: html/div 1 and/html/div 2.
S311, corresponding text information is obtained by traversing the leaf nodes of the label tree.
In the embodiment of the application, because various tags are embedded in the text of the web page, the text part of the web page does not exist in only one leaf node but is dispersed in different leaf nodes, so that corresponding text information needs to be acquired by traversing all the leaf nodes of the tag tree, and the purpose of acquiring the core text information is to acquire the text and find a path capable of acquiring complete text information.
S312, judging whether the text information exists in the core text information.
In the embodiment of the present application, the core text information is obtained by modeling the text density and the symbol density, and the content main body is a text. For the text information obtained by traversing all the leaf nodes in the label tree, the core text information can be used as a reference, because if the text information contained in the leaf nodes belongs to the text, the text information is also present in the core text information, and therefore, whether the text information is a part of the text is determined by judging whether the text information contained in the leaf nodes is present in the core text information.
In one embodiment, the core textual information is segmented into a plurality of textual information in a manner that segments the core textual information and segments the entire sentence. One-to-many text similarity calculation is carried out on the text information contained in the leaf nodes and the core text information, and when the similarity reaches a preset threshold value, the text information is considered to be in the core text information.
In another embodiment, the text information contained in the leaf node is directly subjected to character string matching in the core text information, and whether the text information exists in the core text information is determined according to the matching degree.
And S313, if the absolute path of the current node exists, marking the absolute path as an effective path.
In the embodiment, if the text information does not exist in the core text information, the absolute path of the node corresponding to the text information is recorded as an invalid path, and if the text information exists in the core text information, the absolute path of the node corresponding to the text information is recorded as an valid path.
And S314, combining all the effective paths to obtain the longest common path, namely the rule template of the text.
In the embodiment of the present application, the effective path represents an obtaining path of text part information, that is, a part of text content information can be obtained according to the effective path, and it is now necessary to find a path to obtain all text contents, so that all effective paths are combined to obtain a longest common path, that is, a path to obtain all text contents, that is, a rule template of a text.
In the embodiment of the present application, merging all the effective paths to obtain the longest common path specifically includes the following steps:
s3141, carrying out pairwise similarity calculation on the effective paths, obtaining corresponding similarity path values, and forming a similarity set.
In the embodiment of the present application, since a longest common path needs to be determined according to all effective paths, the common path needs to be found first, and the common path is determined by similarity calculation, and two-by-two similarity calculation is performed on the effective paths, that is, each effective path needs to be subjected to similarity calculation with other effective paths, and after all similarity values are obtained by calculation, a similarity set is formed, where the set includes the effective paths and similarity values related to the effective paths.
S3142, calculating and obtaining the mode and average number in the acquaintance set, and obtaining the threshold point of the effective path.
S3143, selecting the effective path corresponding to the similarity path value greater than or equal to the threshold value point of the effective path as the most similar effective path set.
In this embodiment of the present application, after the similarity set is obtained, a mode and an average of all similarity values that can be obtained are calculated, where the mode represents a similarity value that appears most frequently in the similarity set, and if the average is smaller than the mode, a similarity path value of the mode is taken as a threshold point of an effective path, otherwise, a similarity path value of the average is taken as a threshold point of an effective path, so that all effective paths corresponding to the similarity path values that are greater than or equal to the threshold point of the effective path are reserved as a most similar effective path set.
S3144, according to the most similar effective path set, obtaining the ratio of the maximum same path of the same position node to the total path of the position.
S3145, selecting the path with the maximum same ratio of the nodes larger than the preset threshold value as an effective public path.
In this embodiment of the present application, because the most similar effective path calculates the similarity of the entire path, the finally obtained common path also needs to consider the order of the nodes in the path, and the similarity of two effective paths may be very high, and only one node in the middle is different, but in this case, only the path before the middle different node is valid, and therefore, after the most similar effective path set is obtained, the ratio of the maximum same path of the node at the same position to the total path at the position is also calculated.
When the ratio of the maximum identical path of the nodes at the same position to the total path at the position is greater than a preset threshold, the nodes are considered as a valid public path, and the threshold is generally set to 0.9. For example, the most similar valid path sets are [ '/html/body/div/div', '/html/body/div/a', '/html/body/a/div' ]. The ratio of html at the 1 position is 1; the ratio of the body at the 2-position is 1; the ratio of div at position 3 is 2/3, and the ratio of a is 1/3; the similarity of div at position 4 is 2/3, and the ratio of a is 1/3. Then the label corresponding to the position with the ratio larger than 0.9 is selected as the effective public path, and the final longest public path is/html/body.
S3146, obtaining the longest public path from all the effective public paths, namely the rule template of the text.
In the embodiment of the present application, after all the valid public paths are obtained, if there is more than one valid public path, for example, the valid public path is html/body and html/body/p, a rule template with the longest public path as a text, that is, html/body/p, is selected. The longest common path is the required text rule template because the goal is to obtain only text information and not other content parts.
In the embodiment of the present application, a title rule template is generated according to core text information and webpage structure information, and the method specifically includes the following steps:
and S320, acquiring text contents of the title label according to the webpage structure information, recording the text contents as an experimental group, and acquiring text contents of the h label and the class div label containing the title, and recording the text contents as a comparison group.
In the embodiment of the application, as a result of analyzing a large number of webpages, it is found that the titles of general webpages contain the title tags in the class under the title tag, the h tag and the div, and the specific gravity appearing in the title tag is the highest.
Therefore, according to the webpage structure information, the text content of the title tag is obtained and is marked as an experimental group, and then the text content of the h tag and the class div tag containing the title is obtained and is marked as a comparison group.
S321, calculating the longest common substring for the experiment group and the comparison group, comparing the text content lengths of the experiment group and the comparison group containing the longest common substring, and selecting a group with longer text content as a title and a rule template of the title.
In the embodiment of the application, because the webpage titles generally have the characteristics of high generality and simplicity and can contain some keywords, key sentences and other information, character string matching is performed on the text contents extracted by the experimental group and the comparison group, if a common substring is obtained, it is indicated that both the two groups of text contents contain the same information, and it can be understood that the common information is key information.
In general, the text length of the text part title in the web page is longer than the text length corresponding to other titles, so that a group with longer text content is selected as a rule template of the title and the title by comparing the text lengths of all candidate titles containing common key information.
In the embodiment of the present application, a release time rule template is generated according to core text information and webpage structure information, which is shown in fig. 4, and specifically includes the following steps:
and S330, acquiring all character strings corresponding to the date and the time according to the webpage structure information.
Through analysis of a large number of webpages of the same type, a certain rule is found to exist in the webpage structure when the publishing time of the common webpages is. For example, the release time of a news-type detail page would be between the title and the body, and the release time would be given in a substantially standardized time form, e.g., 2022-08-23 09. Therefore, the time extracted from the webpage can be standardized and matched with the standard time format, and whether the time is the release time is judged according to the matching degree.
In the embodiment of the present application, according to the web page structure information, all the character strings corresponding to the date and time are acquired, mainly acquiring the release time contained in meta, the tag content containing date contained in class, the release time contained in the text containing the release time/date, and the release time in the body text.
And S331, sequentially carrying out time standardization processing on all the acquired character strings, and respectively acquiring corresponding standardized time.
The standardized time format is yyy-MM-dd hh: MM: ss, and the acquired time character string can be converted into the standardized format by directly adopting a time format conversion method in js.
In the embodiment of the application, since the web page contains other information about date and time besides the release time, all the time character strings are converted into the standard time format to obtain the corresponding standardized time, and the release time is determined according to the matching degree of the standardized time and the standardized time format.
And S332, selecting the standardized time with the highest matching degree as the release time through a preset standardized time template.
The preset standardized time template is a priority sequence set according to a standard time format, and the priority is set according to the time, month, day, hour, month, day, month, and day. For example, yyyy-MM-dd hh MM: ss is the highest priority, then yyy-MM-dd hh MM, and so on, and finally MM-dd is the lowest priority.
In the embodiment of the application, all time character strings are converted into the standard time format, so that after the corresponding standardized time is obtained, matching can be performed according to the standardized time and the priority sequence, each standardized time can be matched with the corresponding priority, and finally the standardized time with the highest matching priority is taken as the release time.
And S333, reversely calculating an absolute path of the release time by using the label tree according to the text information corresponding to the release time, namely a release time rule template.
In the embodiment of the present application, after the release time is obtained according to the method, the method needs to be converted into a formulated rule template, so that the release time can be directly obtained according to the formulated rule template. According to the text information corresponding to the release time, a corresponding absolute path can be obtained through the label tree, and the path is the release time template.
In the embodiment of the present application, it is also considered that some situations that are more refined with respect to time may occur in the web page, for example, rocket launching, frame identification in a monitoring scene, and the like may also relate to specific minutes and seconds, and the time that may also occur in the web page may also have a modified time, and also have the same time format as the release time.
Therefore, in order to further improve the accuracy of obtaining the publishing time, for the candidate publishing time with the same high priority, further screening is performed according to the node position of the text information corresponding to the candidate publishing time, and the candidate publishing time of the node position between the title and the body is given a higher priority.
S400, determining a universal rule template according to the text rule template, the title rule template and the release time rule template.
In the embodiment of the application, the obtained text rule template, the title rule template and the release time rule template are obtained according to the current webpage, and the universal rule template is determined according to the current template, so that data extraction can be performed on the webpage data of the same type according to the universal rule template.
In the embodiment of the application, the general rule template is determined according to the text rule template, the title rule template and the release time rule template, and the method specifically comprises the following steps:
s410, generating a special rule template corresponding to preset webpage address information according to the text rule template, the title rule template and the release time rule template, storing the special rule template into a preset database corresponding to the webpage category, counting the number of the special rule templates in the database, and judging whether the number of the special rule templates reaches a preset value.
S420, if not, using the current special rule template as a temporary general rule template;
and S430, if so, screening out the universal rule template according to the special rule template in the database.
In the embodiment of the application, the text rule template, the title rule template and the release time rule template are combined to form the special rule template of the current webpage, the category of the current webpage can be obtained according to preset webpage address information, the professional rule templates are stored in the database corresponding to the category, and when the number of the special rule templates of one type of the webpage reaches a certain number, the universal rule template can be screened out.
And judging whether the number of the special rule templates of the corresponding category of the current webpage in the database reaches a preset value, and if the number of the special rule templates of the corresponding category of the current webpage in the database does not reach the preset value, taking the special rule template of the current webpage as a temporary general rule template.
If the number of the special rule templates of the category corresponding to the current webpage reaches a preset value, calculating the mode in the special rule templates of the category corresponding to the current webpage, namely, if a plurality of special rule templates are the same in the category corresponding to the current webpage, taking the same template as a general rule template, and only if the number of people reaches a certain value, considering that the special rule template corresponding to the mode has high universality, therefore, a mode threshold value can be set, and if the mode in the special template of the category corresponding to the current webpage reaches the preset mode threshold value, taking the special rule template corresponding to the mode as the general rule template.
S500, extracting the core content of the current webpage based on the universal rule template.
In the embodiment of the application, after the special rule template of the current webpage is generated, the core content of the current webpage can be directly extracted according to the special rule template, and after the special rule template of the current webpage is generated, the corresponding general rule template can be generated just, namely before the special rule template of the current webpage is generated, the corresponding general rule template does not exist, so that the core content can be extracted by using the general rule template in order to test the obtained general rule template.
And taking the extraction result of the special rule template as a reference standard, judging the accuracy of the extraction of the general rule template by calculating the text similarity of the core content, if the accuracy reaches a preset standard, considering that the current general rule template is effective, and if the accuracy does not reach the preset standard, considering that the current general rule template does not reach the standard, and continuously collecting the special rule template to regenerate the corresponding general rule template.
Therefore, based on the universal rule template, the core content of the current webpage is actually extracted as follows: and if the universal rule template is a temporary universal rule template, extracting the core content according to the temporary universal rule template, namely the special rule template of the current webpage, and if the universal rule template is not temporary, extracting the core content by respectively using the special rule template of the current webpage and the obtained universal rule template.
In the embodiment of the present application, before obtaining the core text information from the web page structure information, the method further includes the following steps:
s610, judging whether a universal rule template corresponding to the preset webpage information exists or not.
S620, if yes, extracting the core content of the webpage by using a universal rule template;
and S630, if not, acquiring core text information from the webpage structure information.
In the embodiment of the application, after the text, the title and the release time of the webpage are obtained according to the set algorithm, the corresponding rule templates are generated and stored in the database corresponding to the webpage category, when the rule templates collected by a certain category reach the preset value, the universal rule template is selected, and when the core content of the webpage of the category is extracted again, the universal rule template is directly used for extraction.
Therefore, after the web page structure information is acquired according to the web page address information and before the core text information is acquired from the web page structure information, whether a universal rule template exists in a corresponding database or not is determined according to the category contained in the web page address information, if so, the universal rule template is used for extracting the web page core content, if not, the core text information is acquired from the web page structure information, and the subsequent steps are performed, namely S300-S500.
The embodiment of the present application further provides an intelligent extraction system for core content of a web page, referring to fig. 5, including: the system comprises a webpage data acquisition module 101, a rule template generation module 102 and a core content extraction module 103.
The module 101 for acquiring web page data is configured to acquire corresponding web page structure information according to preset web page address information, and acquire core text information from the web page structure information based on a preset extraction manner.
The generation rule template module 102 is configured to generate a body rule template, a title rule template, and a release time rule template according to the core text information and the web page structure information, and then determine a general rule template according to the body rule template, the title rule template, and the release time rule template.
The core content extraction module 103 is configured to extract the core content of the current web page based on the universal rule template.
In the embodiment of the application, the webpage data obtaining module is specifically configured to obtain webpage structure data according to address information of a webpage to be extracted, and obtain core text information according to the webpage structure data in a preset extraction manner.
The rule template generating module is specifically used for acquiring the text, the title and the release time of the webpage according to the webpage combination data and the core text information through a set algorithm, respectively generating corresponding rule templates, then storing the rule templates in a database of a category corresponding to the current webpage, and finally selecting a universal rule template according to the rule templates of the database.
The core content extraction module is specifically used for extracting the core content of the current webpage according to the obtained universal rule template.
The embodiment of the application also provides a computer readable storage medium, which stores a computer program that can be loaded by a processor and execute the above intelligent extraction method for the core content of the web page.
The embodiments of the present invention are preferred embodiments of the present application, and the scope of protection of the present application is not limited by the embodiments, so: all equivalent changes made according to the principles of the present application should be covered by the protection scope of the present application.

Claims (10)

1. An intelligent extraction method for core content of a webpage is characterized by comprising the following steps:
acquiring corresponding webpage structure information according to preset webpage address information;
acquiring core text information from the webpage structure information based on a preset extraction mode;
respectively generating a text rule template, a title rule template and a release time rule template according to the core text information and the webpage structure information;
determining a universal rule template according to the text rule template, the title rule template and the release time rule template;
and extracting the core content of the current webpage based on the universal rule template.
2. The intelligent extraction method of the core content of the web page as claimed in claim 1, wherein the obtaining of the core text information from the web page structure information based on the preset extraction manner comprises:
constructing a DOM tree according to the webpage structure information;
calculating and obtaining the text density and the symbol density of each node in the DOM tree;
calculating scores of all nodes of the DOM tree through a preset algorithm according to the text density and the symbol density;
and sequencing all the nodes from high to low according to the corresponding scores, and taking the content contained in the node with the top ranking as core text information.
3. The method for intelligently extracting core content of a web page according to claim 1, wherein the generating a body rule template, a title rule template and a release time rule template according to core text information and web page structure information respectively comprises:
generating a label tree according to the webpage structure information, wherein the label tree comprises a plurality of leaf nodes, and each leaf node corresponds to an absolute path and text information;
acquiring corresponding text information by traversing leaf nodes of the label tree;
judging whether the text information exists in the core text information;
if the node exists, recording the absolute path of the current node as an effective path;
and merging all the effective paths to obtain the longest common path, namely the rule template of the text.
4. The method for intelligently extracting core content of a web page according to claim 3, wherein the merging all valid paths to obtain the longest common path comprises:
carrying out pairwise similarity calculation on the effective paths to obtain corresponding similarity path values and form a similarity set;
calculating and obtaining mode and average number in the similarity set, and obtaining a threshold value point of the effective path;
selecting an effective path corresponding to the similarity path value of the threshold point which is greater than or equal to the effective path as a most similar effective path set;
according to the most similar effective path set, obtaining the ratio of the maximum same path of the same position node to the total path of the position;
selecting the path with the maximum same ratio of the nodes with the ratio larger than a preset threshold value as an effective public path;
and acquiring the longest common path from all the effective common paths, namely the rule template of the text.
5. The method for intelligently extracting the core content of the web page according to claim 1, wherein the step of respectively generating a text rule template, a title rule template and a release time rule template according to the core text information and the web page structure information comprises the following steps:
acquiring text content of the title label according to the webpage structure information, recording the text content as an experimental group,
acquiring text contents of the h label and a class div label containing the title, and recording the text contents as a comparison group;
and calculating the longest common substring for the experimental group and the comparison group, comparing the text content lengths of the experimental group and the comparison group containing the longest common substring, and selecting one group with longer text content as a title and a rule template of the title.
6. The method for intelligently extracting core content of a web page according to claim 3, wherein the generating a body rule template, a title rule template and a release time rule template according to the core text information and the web page structure information respectively comprises:
acquiring all character strings corresponding to date and time according to the webpage structure information;
sequentially carrying out time standardization processing on all the acquired character strings, and respectively acquiring corresponding standardized time;
selecting the standard time with the highest matching degree as the release time through a preset standard time template;
and reversely calculating an absolute path of the release time by using the label tree according to the text information corresponding to the release time, namely a release time rule template.
7. The intelligent extraction method of the core content of the webpage according to claim 1, wherein the determining of the universal rule template according to the text rule template, the title rule template and the release time rule template comprises:
generating a special rule template corresponding to the preset webpage address information according to the text rule template, the title rule template and the release time rule template, storing the special rule template in a preset database corresponding to the webpage category, counting the number of the special rule templates in the database, judging whether the number of the special rule templates reaches a preset value,
if not, using the current special rule template as a temporary general rule template;
if yes, screening out the universal rule template according to the special rule template in the database.
8. The method for intelligently extracting core content of web page as claimed in claim 1, wherein before obtaining core text information from the web page structure information, further comprising:
judging whether a universal rule template corresponding to the preset webpage information exists or not,
if so, extracting the core content of the webpage by using a universal rule template;
if not, obtaining the core text information from the webpage structure information.
9. An intelligent extraction system for core content of a webpage is characterized by comprising:
the webpage data acquisition module (101) is used for acquiring corresponding webpage structure information according to preset webpage address information and acquiring core text information from the webpage structure information based on a preset extraction mode;
the rule template generating module (102) is used for respectively generating a text rule template, a title rule template and a release time rule template according to the core text information and the webpage structure information, and then determining a universal rule template according to the text rule template, the title rule template and the release time rule template;
and the core content extraction module (103) is used for extracting the core content of the current webpage based on the universal rule template.
10. A computer-readable storage medium storing a computer program which can be loaded by a processor and which can execute a method of intelligently extracting core content of a web page according to any one of claims 1 to 8.
CN202211183811.0A 2022-09-27 2022-09-27 Intelligent extraction method and system for core content of webpage Active CN115658993B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211183811.0A CN115658993B (en) 2022-09-27 2022-09-27 Intelligent extraction method and system for core content of webpage

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211183811.0A CN115658993B (en) 2022-09-27 2022-09-27 Intelligent extraction method and system for core content of webpage

Publications (2)

Publication Number Publication Date
CN115658993A true CN115658993A (en) 2023-01-31
CN115658993B CN115658993B (en) 2023-06-06

Family

ID=84985984

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211183811.0A Active CN115658993B (en) 2022-09-27 2022-09-27 Intelligent extraction method and system for core content of webpage

Country Status (1)

Country Link
CN (1) CN115658993B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117056637A (en) * 2023-09-26 2023-11-14 之江实验室 Barrier-free webpage generation method, barrier-free webpage generation device, computer equipment and storage medium

Citations (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101984434A (en) * 2010-11-16 2011-03-09 东北大学 Webpage data extracting method based on extensible language query
CN102184189A (en) * 2011-04-18 2011-09-14 北京理工大学 Webpage core block determining method based on DOM (Document Object Model) node text density
CN102567530A (en) * 2011-12-31 2012-07-11 凤凰在线(北京)信息技术有限公司 Intelligent extraction system and intelligent extraction method for article type web pages
CN102768661A (en) * 2011-05-05 2012-11-07 株式会社理光 Method and device for extracting object and web page from a plurality of web pages
CN103064966A (en) * 2012-12-31 2013-04-24 中国科学院计算技术研究所 Method for extracting regular noise from single record web pages
CN103870590A (en) * 2014-03-28 2014-06-18 北京奇虎科技有限公司 Webpage identification method and device with error-reported characteristic
CN104462156A (en) * 2013-09-25 2015-03-25 阿里巴巴集团控股有限公司 Feature extraction and individuation recommendation method and system based on user behaviors
CN105786972A (en) * 2010-08-20 2016-07-20 北京新岸线移动多媒体技术有限公司 Webpage template generation method and device
CN106021392A (en) * 2016-05-12 2016-10-12 中国互联网络信息中心 News key information extraction method and system
CN107391678A (en) * 2017-07-21 2017-11-24 福州大学 Web page content information extracting method based on cluster
CN107423391A (en) * 2017-07-24 2017-12-01 福州大学 The information extracting method of Web page structural data
CN108763591A (en) * 2018-06-21 2018-11-06 湖南星汉数智科技有限公司 A kind of webpage context extraction method, device, computer installation and computer readable storage medium
CN109086361A (en) * 2018-07-20 2018-12-25 北京开普云信息科技有限公司 A kind of automatic abstracting method of webpage article information and system based on mutual information between web page joint
CN111079043A (en) * 2019-12-05 2020-04-28 北京数立得科技有限公司 Key content positioning method
CN111428444A (en) * 2020-03-27 2020-07-17 新华智云科技有限公司 Automatic extraction method of webpage information
CN114817811A (en) * 2022-05-07 2022-07-29 盐城金堤科技有限公司 Website analysis method and device

Patent Citations (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105786972A (en) * 2010-08-20 2016-07-20 北京新岸线移动多媒体技术有限公司 Webpage template generation method and device
CN101984434A (en) * 2010-11-16 2011-03-09 东北大学 Webpage data extracting method based on extensible language query
CN102184189A (en) * 2011-04-18 2011-09-14 北京理工大学 Webpage core block determining method based on DOM (Document Object Model) node text density
CN102768661A (en) * 2011-05-05 2012-11-07 株式会社理光 Method and device for extracting object and web page from a plurality of web pages
CN102567530A (en) * 2011-12-31 2012-07-11 凤凰在线(北京)信息技术有限公司 Intelligent extraction system and intelligent extraction method for article type web pages
CN103064966A (en) * 2012-12-31 2013-04-24 中国科学院计算技术研究所 Method for extracting regular noise from single record web pages
CN104462156A (en) * 2013-09-25 2015-03-25 阿里巴巴集团控股有限公司 Feature extraction and individuation recommendation method and system based on user behaviors
CN103870590A (en) * 2014-03-28 2014-06-18 北京奇虎科技有限公司 Webpage identification method and device with error-reported characteristic
CN106021392A (en) * 2016-05-12 2016-10-12 中国互联网络信息中心 News key information extraction method and system
CN107391678A (en) * 2017-07-21 2017-11-24 福州大学 Web page content information extracting method based on cluster
CN107423391A (en) * 2017-07-24 2017-12-01 福州大学 The information extracting method of Web page structural data
CN108763591A (en) * 2018-06-21 2018-11-06 湖南星汉数智科技有限公司 A kind of webpage context extraction method, device, computer installation and computer readable storage medium
CN109086361A (en) * 2018-07-20 2018-12-25 北京开普云信息科技有限公司 A kind of automatic abstracting method of webpage article information and system based on mutual information between web page joint
CN111079043A (en) * 2019-12-05 2020-04-28 北京数立得科技有限公司 Key content positioning method
CN111428444A (en) * 2020-03-27 2020-07-17 新华智云科技有限公司 Automatic extraction method of webpage information
CN114817811A (en) * 2022-05-07 2022-07-29 盐城金堤科技有限公司 Website analysis method and device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
李文立 等: "基于HTML树和模板的文献信息提取方法研究" *
陈壮 等: "一种改进混合文本密度的网页信息提取方法" *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117056637A (en) * 2023-09-26 2023-11-14 之江实验室 Barrier-free webpage generation method, barrier-free webpage generation device, computer equipment and storage medium
CN117056637B (en) * 2023-09-26 2024-03-12 之江实验室 Barrier-free webpage generation method, barrier-free webpage generation device, computer equipment and storage medium

Also Published As

Publication number Publication date
CN115658993B (en) 2023-06-06

Similar Documents

Publication Publication Date Title
US8843490B2 (en) Method and system for automatically extracting data from web sites
US8645385B2 (en) System and method for automating categorization and aggregation of content from network sites
US7941420B2 (en) Method for organizing structurally similar web pages from a web site
CN102119385B (en) Method and subsystem for searching media content within a content-search-service system
TWI424325B (en) Systems and methods for organizing collective social intelligence information using an organic object data model
US20090319449A1 (en) Providing context for web articles
US8812505B2 (en) Method for recommending best information in real time by appropriately obtaining gist of web page and user&#39;s preference
US8577887B2 (en) Content grouping systems and methods
US20090030891A1 (en) Method and apparatus for extraction of textual content from hypertext web documents
CN102119383A (en) Method and subsystem for information acquisition and aggregation to facilitate ontology and language-model generation within a content-search-service system
CN101313330A (en) Selecting high quality reviews for display
US20060026496A1 (en) Methods, apparatus and computer programs for characterizing web resources
CN114238573B (en) Text countercheck sample-based information pushing method and device
CN103294781A (en) Method and equipment used for processing page data
CN111079043A (en) Key content positioning method
US20100146381A1 (en) Method of establishing a plain text document from a html document
JP5442401B2 (en) Behavior information extraction system and extraction method
CN109165373B (en) Data processing method and device
CN112818200A (en) Data crawling and event analyzing method and system based on static website
CN115658993B (en) Intelligent extraction method and system for core content of webpage
CN115659008A (en) Information pushing system and method for big data information feedback, electronic device and medium
KR102185733B1 (en) Server and method for automatically generating profile
CN115017302A (en) Public opinion monitoring method and public opinion monitoring system
WO2007011714A9 (en) Method and system for automatically extracting data from web sites
Aslam et al. Web-AM: An efficient boilerplate removal algorithm for Web articles

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant