CN110457579B - Webpage denoising method and system based on cooperative work of template and classifier - Google Patents

Webpage denoising method and system based on cooperative work of template and classifier Download PDF

Info

Publication number
CN110457579B
CN110457579B CN201910694087.XA CN201910694087A CN110457579B CN 110457579 B CN110457579 B CN 110457579B CN 201910694087 A CN201910694087 A CN 201910694087A CN 110457579 B CN110457579 B CN 110457579B
Authority
CN
China
Prior art keywords
node
template
webpage
classifier
node set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910694087.XA
Other languages
Chinese (zh)
Other versions
CN110457579A (en
Inventor
王运锋
严金承
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan University
Original Assignee
Sichuan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan University filed Critical Sichuan University
Priority to CN201910694087.XA priority Critical patent/CN110457579B/en
Publication of CN110457579A publication Critical patent/CN110457579A/en
Application granted granted Critical
Publication of CN110457579B publication Critical patent/CN110457579B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a webpage denoising method and system based on cooperative work of a template and a classifier, wherein the denoising method comprises the following steps: analyzing the obtained original HTML document, deleting irrelevant tag nodes, and generating a simplified DOM tree meeting the requirement; calculating the characteristics of each block level node in the DOM tree of the target webpage to obtain an original node set; adding the original node set into a cache node set of a corresponding website, and triggering a template generation algorithm to update the template node set of the corresponding website when the number of elements in the cache node set reaches a preset threshold value; filtering an original node set of a target webpage by using a template node set of a website to which the target webpage belongs to obtain a filtered target webpage node set; and classifying the filtered target webpage node set by using a trained classifier, reserving the classification result as a node of the main content, and extracting a main content text from the node. The method has the advantages of less manual intervention and high efficiency, and is suitable for denoising various theme type webpages.

Description

Webpage denoising method and system based on cooperative work of template and classifier
Technical Field
The invention relates to the technical field of webpage denoising, in particular to a webpage denoising method and system based on cooperative work of a template and a classifier.
Background
With the continuous development of internet technology, the amount of information in the internet is increasing and increasing explosively. Massive web page information is a main embodiment of internet information and is a natural data mine in many other research fields, including: search engine, public opinion analysis, natural language processing, etc. However, besides the main content, the web page is also accompanied with other information unrelated to the main content, such as some commercial advertisements, navigation bars, copyright information, announcement messages and the like, and the information can be called as web page noise, so how to remove the noise content in the web page and extract the main content of the web page for the analysis and use in the field has important research significance and practical value.
At present, the main methods for webpage denoising include a rule-based denoising method, a template-based denoising method, a visual content-based denoising method, and the like. The rule-based method is to preset some heuristic rules and screen out text contents meeting the rules, and the method is only suitable for some simple webpages, and complex heuristic rules are needed for webpages with complex structures, so that the method has limitations. The template-based method is high in denoising speed, but a template suitable for a specific website webpage is often required to be constructed manually and cannot be used as a general webpage denoiser, in 2010, lilililii et al, in a thesis Document information extraction method research based on an HTML tree and a template, webpage similarity calculation is adopted to classify different webpages, a corresponding template is constructed for each class, the template utilizes position information of main body content, when the main body content is dispersed to multiple Document Object Model (DOM) nodes, the nearest father node containing the main body content is selected as the template, the extracted main body information may contain a large amount of noise, and the denoising effect is greatly influenced. The denoising method based on visual contents firstly divides a webpage into different blocks, predicts the importance degree of the webpage blocks by utilizing manual labeling and a neural network and a support vector machine, and finally selects the webpage block with the highest importance degree, but the method has large calculation amount and low efficiency.
Disclosure of Invention
The invention aims to solve the technical problem of providing a webpage denoising method and system based on cooperative work of a template and a classifier, which can automatically generate a denoising template for preprocessing, perform classification judgment on DOM nodes by the cooperative classifier, and finally extract main body information; the method has the advantages of less manual intervention and high efficiency, and is suitable for denoising various theme type webpages.
In order to solve the technical problems, the invention adopts the technical scheme that:
a webpage denoising method based on cooperative work of a template and a classifier comprises the following steps:
step 1: downloading a target webpage and acquiring an original HTML document;
step 2: analyzing an original HTML document, deleting irrelevant tag nodes, correcting a DOM tree, and generating a simplified DOM tree meeting the requirement;
and step 3: calculating the characteristics of each block level node in a DOM tree of a target webpage to obtain an original node set of the target webpage;
and 4, step 4: generating a template, namely adding the original node set into a cache node set of a corresponding website, and triggering a template generation algorithm to update the template node set of the corresponding website when the number of elements in the cache node set reaches a preset threshold value;
and 5: filtering an original node set of a target webpage by using a template node set of a website to which the target webpage belongs, and outputting a filtered target webpage node set;
step 6: training a classifier, namely marking some nodes as noise and a main body in advance, and training the classifier by using the marked nodes as training samples until the classifier achieves a preset classification effect;
and 7: and classifying the filtered target webpage node set by using a trained classifier, reserving the classification result as a node of the main content, and extracting a main content text from the node.
Further, the step 1 specifically comprises: the method comprises the steps of webpage downloading and webpage discovery; the webpage downloading is responsible for downloading the target webpage and storing the target webpage into the database in a classified manner according to the difference of domain names and addresses of the target webpage, and the webpage finding is responsible for finding a new webpage address meeting the requirement and adding the new webpage address to the list to be crawled.
Further, the step 2 specifically comprises: pretreatment and correction are included; preprocessing is responsible for deleting tags which do not contain text content, including annotations, scripts and styles, and correcting namely correctable errors of a DOM tree, including "< >" matching errors and tag pair matching errors.
Further, in step 3, the node characteristics include: the node text content length-to-document text content length ratio, the node text content length, the node text content punctuation mark length-to-node text content length ratio, the node link label number-to-document link label number ratio, the node picture label number-to-document picture label number ratio, the node weight fraction, the node internal link character-to-text content length ratio, the node internal link label number plus picture label number-to-node text content length ratio.
Further, in step 6, a classifier model adopted by the classifier is a Support Vector Machine (SVM) or a Classification And Regression Tree (CART).
A webpage denoising system based on cooperative work of a template and a classifier comprises a webpage crawler module, an HTML preprocessing module, a DOM tree feature vector calculation module, a template generation module, a template preprocessing module, a classifier training module and a classifier prediction module;
the webpage crawler module is used for downloading a target webpage and acquiring an original HTML document;
the HTML preprocessing module is used for analyzing an original HTML document, deleting irrelevant tag nodes, correcting the DOM tree and generating a simplified DOM tree meeting the requirement;
the DOM tree feature vector calculation module is used for calculating the feature of each block level node in the DOM tree of the target webpage to obtain an original node set of the target webpage;
the template generation module is used for adding the original node set into a cache node set of the corresponding website, and when the number of elements in the cache node set reaches a preset threshold value, triggering a template generation algorithm to update the template node set of the corresponding website;
the template preprocessing module is used for filtering an original node set of a target webpage by using a template node set of a website to which the target webpage belongs and outputting a filtered target webpage node set;
the classifier training module is used for training a classifier, namely, some nodes are marked as a noise class and a main body class in advance, and the marked nodes are used as training samples to train the classifier until the classifier achieves a preset classification effect;
the classifier prediction module is used for classifying the filtered target webpage node set by using a trained classifier, reserving the classification result as a node of the main content, and extracting a main content text from the node.
Compared with the prior art, the invention has the beneficial effects that: the template and the classifier are used for cooperative work, denoising processing is carried out in two stages, and the denoising effect is good. In the first stage, public noise information of a target website can be automatically identified as a template to carry out noise filtration on a target webpage; in the second stage, the webpage denoising problem is regarded as a classification problem, and the classifier is utilized to screen out main body information. The invention has high processing speed in the first stage, does not need manual intervention, and greatly lightens the processing burden in the second stage due to filtering partial noise information. The method has wide adaptability and is a universal theme type webpage denoising method.
Drawings
FIG. 1 is a flow chart of the method of the present invention.
FIG. 2 is a schematic diagram of the system of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and specific embodiments.
As shown in FIG. 1, the denoising method of the present invention comprises the following steps:
firstly, acquiring an original HTML document through a webpage crawler technology, wherein the original HTML document comprises webpage downloading and webpage discovery. The webpage downloading is responsible for downloading a target webpage and storing the target webpage into a database in a classified manner according to different domain names and addresses of the target webpage; the web page discovery is responsible for discovering a new web page address that meets the requirements and adding it to the list to be crawled.
And secondly, processing the original HTML document, including preprocessing and correcting. Wherein, the preprocessing is responsible for deleting the labels which do not contain text content, such as comments, scripts, styles and the like; the correction is a correctable error for correcting the DOM tree, and comprises a "< >" matching error, a tag pair matching error and the like. And after processing, outputting the simplified DOM tree meeting the requirement.
And thirdly, performing feature calculation on each block level node in the DOM tree, storing the feature calculation in a node structure, and outputting an original node set origin nodes corresponding to the DOM tree. The involved features include: the node text content length-to-document text content length ratio, the node text content length, the node text content punctuation mark length-to-node text content length ratio, the node link label number-to-document link label number ratio, the node picture label number-to-document picture label number ratio, the node weight fraction, the node internal link character-to-text content length ratio, the node internal link label number plus picture label number-to-node text content length ratio. When the above features are statistically calculated, the contents of sub-block-level nodes below the block-level node are excluded, and the feature vector of each block-level node is calculated from the top.
And fourthly, automatically generating a template of a webpage of a certain site, maintaining a template node set Pattern and a cache node set TempNodes for each website, adding an original node set origin node of a target website into the cache node set TempNodes of the corresponding site, counting each node in the cache node set TempNodes once the number of elements in the cache node set TempNodes exceeds a set threshold, wherein the nodes with higher text repetition frequency are generally nodes carrying the copyright information of the website, repeated advertisements and other noise contents, adding the nodes into the template node set Pattern, and the set is the template of the website and records the common noise information of the webpages under the website.
And fifthly, filtering partial noise information in the target webpage through the template node set pattern nodes, and outputting a filtered target webpage node set PreNodes, wherein the PreNodes is origin Nodes-pattern Nodes.
And sixthly, training an SVM or CART classifier. Marking some nodes as noise and a main body in advance, training a classifier by using the marked nodes as training samples, stopping training when the classifier achieves a preset classification effect, and outputting a trained classifier.
And seventhly, classifying the nodes in the filtered target webpage node set PreNodes by using a classifier class, dividing the nodes into a noise node set and a main content node set RstNodes, and finally outputting texts in the main content node set RstNodes.
As shown in fig. 2, the system of the method includes: the system comprises a web crawler module 101, a web preprocessing module 102, a DOM tree feature vector calculation module 103, a template generation module 104, a database system 105, a template preprocessing module 106, a classifier training module 107 and a classifier prediction module 108.
The web crawler module 101: the system is responsible for circularly and uninterruptedly grabbing new target webpages meeting the requirements;
the preprocessing module 102: the module is connected with the module 101, and is used for deleting irrelevant tags from the target webpage, correcting error tag pairs and outputting a simplified DOM tree;
DOM tree feature vector calculation module 103: it is connected with the module 102, and calculates the characteristic vector of the simplified DOM tree and outputs the original node set origin nodes of the target webpage;
the template generation module 104: the template generation module is connected with a module 103, and is used for carrying out template generation processing on an original node set origin node to generate a template node set pattern nodes;
the database 105 is connected with the module 104 and is used for persisting the generated template node set pattern nodes;
and the template preprocessing module 106 is connected with the module 103 to acquire the original node set OriginNodes generated by the module 103, and meanwhile, the template preprocessing module 106 is also connected with the database 105 to query the template node set Pattern nodes of the website to which the target webpage belongs. Outputting a filtered target webpage node set PreNodes;
classifier training module 107: is responsible for training the classifier class;
and a classifier prediction module 108, connected to the module 106, for receiving the filtered target webpage node set PreNodes output by the module 106. Meanwhile, a classifier prediction module 108 is connected to the module 107 for receiving 107 the classifier class provided by the module. The set is divided into two sets of noise and main content by a classifier class, and the main content is output.
The technical effect of the invention is verified by the specific examples below.
Step S201: and taking out a Uniform Resource Locator (URL) from the queue to be crawled, downloading the webpage, screening URLs meeting conditions in the webpage, adding the URLs into the queue to be crawled, and transferring to the step S201 to realize uninterrupted webpage acquisition. And simultaneously preprocessing the webpage, comprising the following steps: and deleting the irrelevant label and correcting the wrong label pair. Then, the webpage is analyzed to be a DOM tree, and the operation is switched to S202 in a parallel mode;
step S202: the feature vector is calculated for each block-level node from the bottom up for the DOM tree output in step S201. The feature components include: the node text content length-to-document text content length ratio, the node text content length, the node text content punctuation mark length-to-node text content length ratio, the node link label number-to-document link label number ratio, the node picture label number-to-document picture label number ratio, the node weight fraction, the node internal link character-to-text content length ratio, the node internal link label number plus picture label number-to-node text content length ratio. When the above features are calculated statistically, the contents of the descendant block nodes are not counted into the node, and each feature vector is stored into the node, so that the whole DOM tree can obtain an original node set OriginNodes, and the process goes to step S203. If the original node set OriginNodes is used for classifier training, the S204 is switched to in a parallel mode;
step S203: adding the original node set origin nodes output in the step S202 to a cache area maintained for a website to which the target webpage belongs, extracting common noise information once the number of elements in the cache area reaches a set threshold value, adding an extraction result to a template node set pattern nodes set, and turning to a step S205; otherwise, go directly to S205.
Step S204: the original node set origin nodes need to be marked manually and used for classifier class training, and once the classifier class achieves the effect of meeting the system requirements, the operation can be stopped. This step is not necessary unless the current classifier class does not meet the system requirements, requiring a new classifier to be trained. And after the training of the classifier class is finished, the step S206 is carried out to update the current classifier class.
Step S205: the template node set patterns are used for filtering the original node set OriginNodes, the effect is equivalent to filtering partial noise information of a target webpage, the information is common noise information of the target website, and the information comprises the following steps: website copyright information, partial advertisement, website webpage structure information and the like, wherein the filtered node set is a filtered target webpage node set PreNodes, and the step is turned to S206.
Step S206: and classifying the filtered target webpage node set PreNodes by using the current classifier class, and outputting a classification result as the content in the main content node.
Through the mode, the main-body type webpage 24334 is obtained for websites such as reference messages, daily reports of people, daily reports of Sichuan, urban reports of Chinese and western cities, flight news, fox searching news, news of new waves, headlines of today, phoenix nets, light nets, world nets, governments of Sichuan people, urban people governments and the like, and denoising is carried out. 2000 random samples are taken for inspection, the average denoising accuracy rate is 98.64%, and the average recall rate is 93.46%. The method is applied to a public opinion analysis system, improves the corpus quality of the system, and has great significance for improving the accuracy of the public opinion analysis system.

Claims (5)

1. A webpage denoising method based on cooperative work of a template and a classifier is characterized by comprising the following steps:
step 1: downloading a target webpage and acquiring an original HTML document;
step 2: analyzing an original HTML document, deleting irrelevant tag nodes, correcting a DOM tree, and generating a simplified DOM tree meeting the requirement;
and step 3: calculating the characteristics of each block level node in a DOM tree of a target webpage to obtain an original node set of the target webpage; the node features include: the node text content length-to-document text content length ratio, the node text content length, the node text content punctuation mark length-to-node text content length ratio, the node link label number-to-document link label number ratio, the node picture label number-to-document picture label number ratio, the node weight fraction, the node internal link character-to-text content length ratio, the node internal link label number plus picture label number-to-node text content length ratio; when the characteristics are calculated in a statistical mode, the content of sub-block level nodes below the block level node is excluded, and the characteristic vector of each block level node is calculated from the top to the bottom;
and 4, step 4: generating a template, namely automatically generating a template of a webpage of a certain website, maintaining a template node set and a cache node set for each website, adding an original node set into the cache node set of the corresponding website, triggering a template generation algorithm when the number of elements in the cache node set reaches a preset threshold value, namely counting each node in the cache node set, and adding a node with the text repetition frequency exceeding a set value into the template node set of the corresponding website so as to update the template node set of the corresponding website;
and 5: filtering an original node set of a target webpage by using a template node set of a website to which the target webpage belongs, and outputting a filtered target webpage node set;
step 6: training a classifier, namely marking some nodes as noise and a main body in advance, adding the marked nodes into a set, and training the classifier by using the nodes in the set as training samples until the classifier achieves a preset classification effect;
and 7: and classifying the nodes in the filtered target webpage node set by using a trained classifier, reserving the classification result as the node of the main content, and extracting the main content text from the node.
2. The method for denoising a webpage based on the cooperative work of the template and the classifier as claimed in claim 1, wherein the step 1 specifically comprises: the method comprises the steps of webpage downloading and webpage discovery; the webpage downloading is responsible for downloading the target webpage and storing the target webpage into the database in a classified manner according to the difference of domain names and addresses of the target webpage, and the webpage finding is responsible for finding a new webpage address meeting the requirement and adding the new webpage address to the list to be crawled.
3. The method for denoising a webpage based on the cooperative work of the template and the classifier as claimed in claim 1, wherein the step 2 is specifically: pretreatment and correction are included; preprocessing is responsible for deleting tags which do not contain text content, including annotations, scripts and styles, and correcting namely correctable errors of a DOM tree, including "< >" matching errors and tag pair matching errors.
4. The method for denoising web pages based on the cooperation of the template and the classifier as claimed in claim 1, wherein in step 6, the classifier model adopted by the classifier is a support vector machine or a classification regression tree.
5. A webpage denoising system based on cooperative work of a template and a classifier is characterized by comprising a webpage crawler module, an HTML preprocessing module, a DOM tree feature vector calculation module, a template generation module, a template preprocessing module, a classifier training module and a classifier prediction module;
the webpage crawler module is used for downloading a target webpage and acquiring an original HTML document;
the HTML preprocessing module is used for analyzing an original HTML document, deleting irrelevant tag nodes, correcting the DOM tree and generating a simplified DOM tree meeting the requirement;
the DOM tree feature vector calculation module is used for calculating the feature of each block level node in the DOM tree of the target webpage to obtain an original node set of the target webpage;
the node features include: the node text content length-to-document text content length ratio, the node text content length, the node text content punctuation mark length-to-node text content length ratio, the node link label number-to-document link label number ratio, the node picture label number-to-document picture label number ratio, the node weight fraction, the node internal link character-to-text content length ratio, the node internal link label number plus picture label number-to-node text content length ratio; when the characteristics are calculated in a statistical mode, the content of sub-block level nodes below the block level node is excluded, and the characteristic vector of each block level node is calculated from the top to the bottom;
the template generation module is used for adding the original node set into a cache node set of a corresponding website, when the number of elements in the cache node set reaches a preset threshold value, a template generation algorithm is triggered, namely counting each node in the cache node set, and adding the node with the text repetition frequency exceeding a set value into the template node set of the corresponding website so as to update the template node set of the corresponding website;
the template preprocessing module is used for filtering an original node set of a target webpage by using a template node set of a website to which the target webpage belongs and outputting a filtered target webpage node set;
the classifier training module is used for training a classifier, namely, some nodes are marked as a noise class and a main body class in advance, and the marked nodes are used as training samples to train the classifier until the classifier achieves a preset classification effect;
the classifier prediction module is used for classifying the nodes in the filtered target webpage node set by using the trained classifier, reserving the nodes with classification results as main content, and extracting a main content text from the nodes.
CN201910694087.XA 2019-07-30 2019-07-30 Webpage denoising method and system based on cooperative work of template and classifier Active CN110457579B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910694087.XA CN110457579B (en) 2019-07-30 2019-07-30 Webpage denoising method and system based on cooperative work of template and classifier

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910694087.XA CN110457579B (en) 2019-07-30 2019-07-30 Webpage denoising method and system based on cooperative work of template and classifier

Publications (2)

Publication Number Publication Date
CN110457579A CN110457579A (en) 2019-11-15
CN110457579B true CN110457579B (en) 2022-03-22

Family

ID=68483966

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910694087.XA Active CN110457579B (en) 2019-07-30 2019-07-30 Webpage denoising method and system based on cooperative work of template and classifier

Country Status (1)

Country Link
CN (1) CN110457579B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110851606A (en) * 2019-11-18 2020-02-28 杭州安恒信息技术股份有限公司 Website clustering method and system based on webpage structure similarity
CN112199613B (en) * 2020-10-13 2023-03-03 北京理工大学 Product URL automatic positioning method integrating DOM topology and text attributes
CN112347353B (en) * 2020-11-06 2024-05-24 同方知网(北京)技术有限公司 Method for denoising webpage
CN112528205B (en) * 2020-12-22 2021-10-29 中科院计算技术研究所大数据研究院 Webpage main body information extraction method and device and storage medium
CN113254751B (en) * 2021-06-24 2021-09-21 北森云计算有限公司 Method, equipment and storage medium for accurately extracting complex webpage structured information

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7765236B2 (en) * 2007-08-31 2010-07-27 Microsoft Corporation Extracting data content items using template matching
CN101197849B (en) * 2007-12-21 2012-10-03 腾讯科技(深圳)有限公司 Method for commuting internet page into wireless application protocol page
US20100169311A1 (en) * 2008-12-30 2010-07-01 Ashwin Tengli Approaches for the unsupervised creation of structural templates for electronic documents
CN101727498A (en) * 2010-01-15 2010-06-09 西安交通大学 Automatic extraction method of web page information based on WEB structure
CN103744981B (en) * 2014-01-14 2017-02-15 南京汇吉递特网络科技有限公司 System for automatic classification analysis for website based on website content
CN103838823B (en) * 2014-01-22 2017-02-22 浙江大学 Website content accessible detection method based on web page templates
CN107577783A (en) * 2017-09-15 2018-01-12 电子科技大学 The type of webpage automatic identifying method excavated based on Web architectural features

Also Published As

Publication number Publication date
CN110457579A (en) 2019-11-15

Similar Documents

Publication Publication Date Title
CN110457579B (en) Webpage denoising method and system based on cooperative work of template and classifier
CN103294781B (en) A kind of method and apparatus for processing page data
Sun et al. Dom based content extraction via text density
CN104598577B (en) A kind of extracting method of Web page text
CN105279277A (en) Knowledge data processing method and device
CN102270206A (en) Method and device for capturing valid web page contents
CN103514234A (en) Method and device for extracting page information
CN107885793A (en) A kind of hot microblog topic analyzing and predicting method and system
CN103544255A (en) Text semantic relativity based network public opinion information analysis method
CN106021383A (en) Method and device for computing similarity of webpages
CN102841920A (en) Method and device for extracting webpage frame information
CN111726336B (en) Method and system for extracting identification information of networked intelligent equipment
CN103678412A (en) Document retrieval method and device
CN103389998A (en) Novel Internet commercial intelligence information semantic analysis technology based on cloud service
CN103309862A (en) Webpage type recognition method and system
CN106250402B (en) Website classification method and device
CN103440315A (en) Web page cleaning method based on theme
CN112818200A (en) Data crawling and event analyzing method and system based on static website
CN112287272A (en) Method, system and storage medium for classifying website list pages
CN105740355A (en) Aggregated text density based webpage body text extraction method and apparatus
CN106372232B (en) Information mining method and device based on artificial intelligence
CN102902792B (en) list page identification system and method
CN107145591A (en) A kind of effective content metadata extracting method of webpage based on title
CN105574004B (en) A kind of removing duplicate webpages method and apparatus
CN103488741A (en) Online semantic excavation system of Chinese polysemic words and based on uniform resource locator (URL)

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant