CN105786972A - Webpage template generation method and device - Google Patents

Webpage template generation method and device Download PDF

Info

Publication number
CN105786972A
CN105786972A CN201610074217.6A CN201610074217A CN105786972A CN 105786972 A CN105786972 A CN 105786972A CN 201610074217 A CN201610074217 A CN 201610074217A CN 105786972 A CN105786972 A CN 105786972A
Authority
CN
China
Prior art keywords
webpage
web page
template
eigenvalue
cutting
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610074217.6A
Other languages
Chinese (zh)
Inventor
郑清芳
章动
鲍东山
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Nufront Mobile Multimedia Technology Co Ltd
Original Assignee
Beijing Nufront Mobile Multimedia Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Nufront Mobile Multimedia Technology Co Ltd filed Critical Beijing Nufront Mobile Multimedia Technology Co Ltd
Priority to CN201610074217.6A priority Critical patent/CN105786972A/en
Publication of CN105786972A publication Critical patent/CN105786972A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • G06F16/986Document structures and storage, e.g. HTML extensions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a webpage template generation method. The method comprises the following steps: acquiring a webpage under a webpage address equivalent directory with a predetermined number; dividing the webpage into blocks of division blocks, computing a feature value of each division block; counting the computed feature values; saving the feature value with repeated number more than a preset threshold value in a feature value library as the feature value of a template part. The invention further provides a corresponding device. An adaptive webpage template can be generated based on the known webpage according to the method disclosed by the invention, the generated template can well reflect the content of the webpage, the real content part in the webpage can be analyzed when the template is used for the webpage analysis, thereby reducing the interference of junk information; the accuracy and the precision of the webpage analysis are improved, and the webpage analysis effect is obviously improved.

Description

A kind of web page template generates method and device
The application is application number to be 201010259001.X, the applying date be August 20 in 2010, day, were called the divisional application of the Chinese invention patent application of " a kind of method for analyzing internet web page contents and device ".
Technical field
The present invention relates to communication and Internet technical field, be specifically related to a kind of web page template and generate method and device.
Background technology
In recent years, along with the maturation of universal, the lifting of bandwidth of network, service mode, search engine is increasingly becoming the mainstream applications of the Internet.Technically, internet search engine is generally made up of two parts, i.e. processed offline part and online treatment part.Processed offline part mainly includes webpage capture, web analysis and the main functional modules such as index, and online treatment block process includes: the query word submitted to according to user, the index and data of the generation of processed offline module are inquired about the document (i.e. webpage) of correspondence, and the document inquired is sorted according to certain index, the result after sequence returns to user the most at last.
In the whole service process of search engine, web analysis plays basic pivotal role, and it in fact determines which data and content are for generating index, it is thus possible to finally inquired by user.Due to technology and business, the current content in each webpage is very complicated, except the content really expressed of webpage, is also doped with a lot of irrelevant information, for instance advertising message, recommendation information etc..Owing to the accuracy of web analysis largely have impact on end user's experience of search engine service, having a variety of method to be suggested at present, in order to improve the parsing to web page contents, both approaches can classify as two kinds:
The first, by the mode of character stream, according to each label and the positional information in webpage, counts the feature of various piece, goes out title and the text of webpage and other parts according to their feature analysis.
The second is the method set with DOM Document Object Model (DOM, DocumentObjectModel).First build dom tree according to original web page, compare the attribute setting each node to judge the content of webpage.
Above-mentioned both approaches, is inherently some partial content utilizing make in advance one group of rule to choose in webpage.But, the arranged mode of webpage is multifarious, it is impossible to exhaustive.There is the problem of bad adaptability in these methods, some is likely to be suitable for some webpage, and the defect of inapplicable other webpage, makes the final result of web analysis or there is junk information, or lost actually useful information in actual motion.
Summary of the invention
In view of this, the present invention provides a kind of web page template to generate method and device, can generate the most adaptive template and carry out analyzing web page.
The embodiment of the present invention provides a kind of method that web page template generates, and comprises the steps:
Obtain the webpage under the equivalent catalogue of web page address of predetermined quantity;
Described segmenting web page is become some cutting blocks, calculates the eigenvalue of described each cutting block;
Calculated described eigenvalue is added up;
Frequency of occurrence is saved in eigenvalue storehouse more than the eigenvalue of predetermined threshold value, as the eigenvalue of template part.
In some optional embodiments, when described segmenting web page is become some cutting blocks, carry out cutting using DOM Document Object Model DOM node as separation.
In some optional embodiments, described segmenting web page becoming some cutting blocks, the length of each piecemeal content is no less than 20 bytes.
In some optional embodiments, the computational methods of the eigenvalue of described each cutting block are by the content of each piecemeal is adopted Hash operation.
The embodiment of the present invention also provides for the device that a kind of web page template generates, including:
Acquisition module, the webpage under the equivalent catalogue of web page address for obtaining predetermined quantity;
Computing module, for described segmenting web page is become some cutting blocks, calculates the eigenvalue of described each cutting block;
Statistical module, for adding up calculated described eigenvalue;
Generation module, for being saved in eigenvalue storehouse by frequency of occurrence more than the eigenvalue of predetermined threshold value, as the eigenvalue of template part.
In some optional embodiments, described computing module, during specifically for described segmenting web page is become some cutting blocks, carry out cutting using DOM Document Object Model DOM node as separation.
In some optional embodiments, described computing module, specifically for segmenting web page becomes some cutting blocks, the length of each piecemeal content is no less than 20 bytes.
In some optional embodiments, described computing module, specifically for:
The computational methods of the eigenvalue of described each cutting block are by the content of each piecemeal is adopted Hash operation.
The invention provides a kind of web page template and generate method, it is possible to the webpage under the equivalent catalogue of web page address by obtaining predetermined quantity;The statistical result of the eigenvalue according to each cutting block of webpage, using frequency of occurrence more than the eigenvalue of predetermined threshold value as the eigenvalue of template part, it is achieved generate the most adaptive template and carry out analyzing web page.The shortcoming that the present invention overcomes current method, the template generated can better meet user's request, it is for time in web analysis process, it also is able to realize only content part real in webpage being resolved, thus reducing the interference of junk information, improve the accuracy and precision of web page analysis, the effect of web page analysis is greatly improved.
Figure of description
Fig. 1 is the method for analyzing internet web page contents flow chart provided in the embodiment of the present invention;
Fig. 2 is the flow chart that the web page template provided in the embodiment of the present invention generates method;
Fig. 3 is the particular flow sheet generating new template in the embodiment of the present invention;
Fig. 4 show a kind of internet web page contents resolver schematic diagram in the embodiment of the present invention.
Detailed description of the invention
Defect for prior art, the invention provides a kind of method for analyzing internet web page contents, can for the different channel paging of even each website, each website, analysis and the process of webpage is carried out by method targetedly, webpage can be automatically analyzed whether by template generation, and the template corresponding with webpage can be automatically generated, thus the most adaptive template is utilized to carry out analyzing web page.The shortcoming that the present invention overcomes current method, it is possible to only content part real in webpage is resolved, thus reducing the interference of junk information, improving the accuracy and precision of web page analysis, the effect of web page analysis is greatly improved.
With reference to Fig. 1, a kind of method for analyzing internet web page contents that the embodiment of the present invention provides, comprise the steps:
S11, it is judged that whether webpage to be resolved is by template generation;If this webpage is not by template generation, then forward step S12 to;Otherwise, step S13 is forwarded to;
S12, resolves this webpage by default mode;
Whether S13, existed the template matched with webpage to be resolved in query webpage template base;
If web page template storehouse has existed template that match with webpage to be resolved, then perform step S15, utilize the template corresponding with webpage to be resolved to resolve the content of this webpage;Otherwise, step S14 is performed;
S14, generates the web page template corresponding with webpage to be resolved, and is joined in web page template storehouse by the web page template of generation;
S15, utilizes the template corresponding with webpage to be resolved to resolve the content of this webpage;
For new Blockbased Web Page, the corresponding template generated is utilized to resolve this webpage.
In step S11, web page template storehouse pre-builds, and initializes before first time inquiry.
Judge that whether webpage to be resolved is by identifying that uniform resource position mark URL realizes by template generation, specifically include:
Judge according to the rule that URL generates;Or
Whether identify in URL has the mark of catalogue to judge.
In step S13, whether having there is the template matched with webpage in described query template storehouse, concrete steps include:
The character string of the instruction catalogue in the URL that acquisition webpage is corresponding;
Above-mentioned character string is utilized to inquire about in template base.
In step 15, the template corresponding with webpage to be resolved is utilized to resolve the content of this webpage, specific as follows:
Described Webpage is split, and calculates the eigenvalue of each piece;
Inquire about in the template corresponding with this webpage according to features described above value;
If having there is this eigenvalue in template, then corresponding with this eigenvalue web page release is without resolving;
If template is absent from this eigenvalue, then the web page release corresponding with this eigenvalue is resolved by default mode.
Generate the webpage splitting method adopted in web page template process identical with utilizing the webpage splitting method adopted in template analyzing web page content process.
In step S15, generate the web page template corresponding with webpage to be resolved, specifically include:
A () obtains other webpages being equal under catalogue with web page address to be browsed, and the webpage number chosen reaches required predetermined threshold;
B Webpage under this catalogue chosen is split by (), each piece all generates an eigenvalue, the corresponding multiple eigenvalues of each Webpage;
C the All Eigenvalues of webpages all under this catalogue is added up by (), obtain the frequency of occurrences part eigenvalue higher than threshold value, and be saved in template base.
In step S15, the web page template of generation is joined in web page template storehouse, including:
The character string of the instruction catalogue in the URL that acquisition webpage is corresponding;
Above-mentioned character string is added template base with all frequency of occurrences under this Web page listings higher than the eigenvalue of predetermined threshold value in the way of key-value.
With reference to Fig. 2, the embodiment of the present invention also provides for a kind of method that web page template generates, and comprises the steps:
S21, obtains the webpage under the equivalent catalogue of web page address of predetermined quantity;
S22, becomes some cutting blocks by described segmenting web page, calculates the eigenvalue of described each cutting block;
When described segmenting web page is become some cutting blocks, carry out cutting using DOM Document Object Model DOM node as separation.
Segmenting web page becomes some cutting blocks, and the length of each piecemeal content is no less than 20 bytes.
The computational methods of the eigenvalue of described each cutting block are that the content to piecemeal adopts Hash operation.
S23, adds up calculated described eigenvalue;
S24, is saved in eigenvalue storehouse by frequency of occurrence more than the eigenvalue of predetermined threshold value, as the eigenvalue of template part.
For making principles of the invention, characteristic and advantage clearly, it is described below in conjunction with specific embodiment.
In the present embodiment, if webpage to be analyzed is http://news.sina.com.cn, then this URL and corresponding original web page are sent into system and processes.Assuming that the template number just started in common template is 0 (namely just to start, do not generate any template), first, system can judge whether it is template generation according to uniform resource position mark URL, URL (URL, the abbreviation of Uniform/UniversalResourceLocator) it is also referred to as web page address, it is the address (Address) of the resource of standard on the Internet.According to the URL rule generated, it can be determined that this URL is the news channel page of sina.com.cn, so not being template generation.In such a case, it is possible to the method without template that returns processes.Alternatively, it is also possible to judge that it is not by template generation by another principle: because this URL do not have/, i.e. the mark of catalogue, it is taken as that this URL is not belonging to any catalogue, namely not by template generation.Also directly return, resolve by general mode.
And for this webpage of http://news.sina.com.cn/h/2010-07-15/141820685517.shtml, according to the URL rule generated, can interpolate that out that its catalogue is the part before " http://news.sina.com.cn/h/2010-07-15 " i.e. last "/" easily, this character string is utilized to inquire about in template base, because at this moment not generating template in common template storehouse, so character string does not have the template of correspondence, in this case will call template generation module, generate new template:
As it is shown on figure 3, in the present embodiment, the idiographic flow generating new template is as follows:
S31, acquisition are such as other webpages under the equivalent catalogue of http://news.sina.com.cn/h/2010-07-15/075320682851.shtml, and its webpage number to exceed the threshold value generating the minimum webpage of template needs, if be unsuccessfully returned to.
S32, by obtain this catalogue under all pages all split, each piece all generates one eigenvalue (md5 value), each page correspondence multiple eigenvalues (md5 value).
S33, the All Eigenvalues of webpages all under this catalogue is added up, show that the frequency of occurrences is higher than the part eigenvalue of threshold value.
S34, by this directory characters string, join and join in existing template base higher than the eigenvalue of threshold value with the frequency of occurrences in S33.So just the parsing template corresponding with webpage to be resolved is generated.
In step S31, it is possible to according to known URL as follows
Http:// news.sina.com.cn/h/2010-07-15/075320682851.shtml learns that the catalogue at its place is http://news.sina.com.cn/h/2010-07-15, travels through this catalogue, it is possible to obtain other webpages under this catalogue.
In step S32, the piecemeal of webpage and the generation of block eigenvalue: general web page code is in compliance with HTML standard specification, and a corresponding DOM model, this model is made up of some content nodes.
With nature node for separation, generally nature cutting should be carried out with labels such as tr, td, div when web page release.The length general control of piecemeal content is no less than 20 bytes.
When concrete cutting, it is possible to from the first character of webpage, the node that scanning sets, (node such as set is td, tr, div etc.), if running into these nodes, just position herein is set to the starting position of block.Then go for next position by same method, if the distance length of adjacent position is more than the minimum length (here with 20) set, just the part in the middle of two positions is used as one piece, this block is generated fingerprint just passable.The end position concurrently setting this block is exactly the starting position of next block, if the distance of adjacent position is less than minimum length, continue to find next node (it is invalid that middle node is just set to) until finding the node distance with the node of this block beginning more than minimum range (or finding the ending of webpage).
The generation of specific features value, generally in order to ensure that different blocks has different eigenvalues, generally can select relatively reliable encryption method, for instance md5 algorithm.
In step S33, first count the number of webpage under this catalogue, the eigenvalue of all web page release under this catalogue is being added up.If the frequency of occurrence of certain eigenvalue is more than default threshold value, this just illustrates: the web page release corresponding with this eigenvalue occurs in a lot of webpage, and therefore its content is valueless, it is likely to advertising message, navigation information etc..All frequency of occurrences are stored in template base more than the eigenvalue of threshold value.
If run into the webpage under same catalogue more later, as:
Http:// news.sina.com.cn/h/2010-07-15/075320682851.shtml,
Similarly, the catalogue of this URL is obtained
Http:// news.sina.com.cn/s/2010-07-15,
And inquire about in template base by this character string.Because the template corresponding with this character string exists, so this template can be found in template base.At this moment can to following webpage:
The content of http://news.sina.com.cn/h/2010-07-15/075320682851.shtml splits, and all generate a md5 value each piece split, by this md5 value in the template corresponding with above-mentioned character string, namely characteristic value sequence is found, if this md5 value exists in a template, just illustrate that this block is valueless piecemeal, not resolve;If can not find this md5 just illustrate that this block is the meaningful part of webpage.This piecemeal is resolved by default mode.
With reference to Fig. 4, the embodiment of the present invention also provides for a kind of internet web page contents resolver 40, including such as lower module:
Judge module 41, for judging that whether webpage to be resolved be by template generation;
Memory module 42, is used for storing web page template storehouse;
Whether the first enquiry module 43, for existing the template corresponding with webpage to be resolved in query webpage template base;
Second enquiry module 44, user's inquiry and webpage to be resolved are to whether there is certain eigenvalue in deserved template;
Generation module 45, for generating the template corresponding with webpage to be resolved;
First parsing module 46, for resolving webpage to be resolved by default mode;
Second parsing module 47, resolves by default mode for certain piecemeal treated in analyzing web page;
Presetting module 48, for arranging the concrete analysis mode of the first parsing module 46 and the second parsing module 47.
The workflow of this device is essentially identical with preceding method, does not repeat them here.
The embodiment of the present invention also provides for a kind of internet web page contents resolver, including such as lower module:
Judge module, for judging that whether webpage to be resolved be by template generation;
Whether enquiry module, if being by template generation for this webpage, then existed the template matched with webpage to be resolved in query webpage template base;
Generation module, if for being absent from the template matched with webpage to be resolved in web page template storehouse, generating the web page template corresponding with webpage to be resolved, and joined in web page template storehouse by the web page template of generation;
Parsing module, if having there is, in web page template storehouse, the template matched with webpage to be resolved, then utilizes the template corresponding with webpage to be resolved to resolve the content of this webpage;If web page template storehouse is absent from the template matched with webpage to be resolved, the template that raw module generates is utilized to resolve above-mentioned webpage.
The embodiment of the present invention also provides for the device that a kind of web page template generates, including:
Acquisition module, the webpage under the equivalent catalogue of web page address for obtaining predetermined quantity.
Computing module, for segmenting web page is become some cutting blocks, calculates the eigenvalue of each cutting block.
Statistical module, for adding up calculated eigenvalue.
Generation module, for being saved in eigenvalue storehouse by frequency of occurrence more than the eigenvalue of predetermined threshold value, as the eigenvalue of template part.
Above-mentioned computing module, during specifically for described segmenting web page is become some cutting blocks, carries out cutting using DOM Document Object Model DOM node as separation.
Above-mentioned computing module, specifically for segmenting web page becomes some cutting blocks, the length of each piecemeal content is no less than 20 bytes.
Above-mentioned computing module, the computational methods specifically for the eigenvalue of each cutting block are by the content of each piecemeal is adopted Hash operation.
In sum, the invention provides a kind of method for analyzing internet web page contents, when webpage to be resolved is by template generation, if web page template storehouse has existed template that match with webpage to be resolved, then the template corresponding with webpage to be resolved is utilized to resolve the content of this webpage;Otherwise, generate the web page template corresponding with webpage to be resolved, and the web page template of generation is joined in web page template storehouse, and utilize this template to resolve above-mentioned webpage.Can for the different channel paging of even each website, each website according to the present invention, analysis and the process of webpage is carried out by method targetedly, webpage can be automatically analyzed whether by template generation, and the template corresponding with webpage can be automatically generated, thus utilizing the most adaptive template to carry out analyzing web page.The shortcoming that the present invention overcomes current method, it is possible to only content part real in webpage is resolved, thus reducing the interference of junk information, improving the accuracy and precision of web page analysis, the effect of web page analysis is greatly improved.
According to described disclosed embodiment, it is possible to make those skilled in the art be capable of or use the present invention.To those skilled in the art, the various amendments of these embodiments are apparent from, and the general principles defined here can also be applied to other embodiments on without departing from the basis of the scope and spirit of the present invention.Embodiment described above is only presently preferred embodiments of the present invention, not in order to limit the present invention, all within the spirit and principles in the present invention, any amendment of making, equivalent replacement, improvement etc., should be included within protection scope of the present invention.

Claims (8)

1. the method that a web page template generates, it is characterised in that comprise the steps:
Obtain the webpage under the equivalent catalogue of web page address of predetermined quantity;
Described segmenting web page is become some cutting blocks, calculates the eigenvalue of described each cutting block;
Calculated described eigenvalue is added up;
Frequency of occurrence is saved in eigenvalue storehouse more than the eigenvalue of predetermined threshold value, as the eigenvalue of template part.
2. the method for claim 1, it is characterised in that when described segmenting web page is become some cutting blocks, carries out cutting using DOM Document Object Model DOM node as separation.
3. the method for claim 1, it is characterised in that
Described segmenting web page becoming some cutting blocks, the length of each piecemeal content is no less than 20 bytes.
4. the method for claim 1, it is characterised in that
The computational methods of the eigenvalue of described each cutting block are by the content of each piecemeal is adopted Hash operation.
5. the device that a web page template generates, it is characterised in that including:
Acquisition module, the webpage under the equivalent catalogue of web page address for obtaining predetermined quantity;
Computing module, for described segmenting web page is become some cutting blocks, calculates the eigenvalue of described each cutting block;
Statistical module, for adding up calculated described eigenvalue;
Generation module, for being saved in eigenvalue storehouse by frequency of occurrence more than the eigenvalue of predetermined threshold value, as the eigenvalue of template part.
6. device as claimed in claim 5, it is characterised in that described computing module, during specifically for described segmenting web page is become some cutting blocks, carries out cutting using DOM Document Object Model DOM node as separation.
7. device as claimed in claim 5, it is characterised in that described computing module, specifically for segmenting web page becomes some cutting blocks, the length of each piecemeal content is no less than 20 bytes.
8. device as claimed in claim 5, it is characterised in that described computing module, specifically for:
The computational methods of the eigenvalue of described each cutting block are by the content of each piecemeal is adopted Hash operation.
CN201610074217.6A 2010-08-20 2010-08-20 Webpage template generation method and device Pending CN105786972A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610074217.6A CN105786972A (en) 2010-08-20 2010-08-20 Webpage template generation method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610074217.6A CN105786972A (en) 2010-08-20 2010-08-20 Webpage template generation method and device

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
CN201010259001.XA Division CN101916285B (en) 2010-08-20 2010-08-20 A kind of method for analyzing internet web page contents and device

Publications (1)

Publication Number Publication Date
CN105786972A true CN105786972A (en) 2016-07-20

Family

ID=56402619

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610074217.6A Pending CN105786972A (en) 2010-08-20 2010-08-20 Webpage template generation method and device

Country Status (1)

Country Link
CN (1) CN105786972A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108121743A (en) * 2016-11-30 2018-06-05 中移(苏州)软件技术有限公司 A kind of generation of generic web pages masterplate and application method, system
CN113535175A (en) * 2021-07-23 2021-10-22 工银科技有限公司 Application program front-end code generation method and device, electronic equipment and medium
CN115658993A (en) * 2022-09-27 2023-01-31 观澜网络(杭州)有限公司 Intelligent extraction method and system for core content of webpage

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101192234A (en) * 2007-06-07 2008-06-04 腾讯科技(深圳)有限公司 Searching system and method based on web page extraction
CN101464905A (en) * 2009-01-08 2009-06-24 中国科学院计算技术研究所 Web page information extraction system and method
US7765236B2 (en) * 2007-08-31 2010-07-27 Microsoft Corporation Extracting data content items using template matching

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101192234A (en) * 2007-06-07 2008-06-04 腾讯科技(深圳)有限公司 Searching system and method based on web page extraction
US7765236B2 (en) * 2007-08-31 2010-07-27 Microsoft Corporation Extracting data content items using template matching
CN101464905A (en) * 2009-01-08 2009-06-24 中国科学院计算技术研究所 Web page information extraction system and method

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
冯少卿等: "网页结构模板生成新方法研究", 《北京机械工业学院学报》 *
徐铁等: "网页信息抽取方法的研究", 《信息技术》 *
苏文健: "基于DOM和网页模板的信息抽取", 《万方数据库》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108121743A (en) * 2016-11-30 2018-06-05 中移(苏州)软件技术有限公司 A kind of generation of generic web pages masterplate and application method, system
CN113535175A (en) * 2021-07-23 2021-10-22 工银科技有限公司 Application program front-end code generation method and device, electronic equipment and medium
CN115658993A (en) * 2022-09-27 2023-01-31 观澜网络(杭州)有限公司 Intelligent extraction method and system for core content of webpage

Similar Documents

Publication Publication Date Title
CN101916285B (en) A kind of method for analyzing internet web page contents and device
CN101706807B (en) Method for automatically acquiring new words from Chinese webpages
US7502995B2 (en) Processing structured/hierarchical content
US9218482B2 (en) Method and device for detecting phishing web page
CN108737423B (en) Phishing website discovery method and system based on webpage key content similarity analysis
CN101950312B (en) Method for analyzing webpage content of internet
US7483903B2 (en) Unsupervised learning tool for feature correction
US20090193044A1 (en) Web graph compression through scalable pattern mining
CN104283723B (en) Network access log processing method and processing device
JP5930496B2 (en) Method and apparatus for acquiring structured information in layout file
CN1912872A (en) Method and system for abstracting new word
CN102411617B (en) Method for storing and inquiring a large quantity of URLs
CN103491089B (en) Code-transferring method and system in a kind of data convert based on HTTP
CN108228710B (en) Word segmentation method and device for URL
CN104750704A (en) Webpage uniform resource locator (URL) classification and identification method and device
CN110457579B (en) Webpage denoising method and system based on cooperative work of template and classifier
CN101339560B (en) Method and device for searching series data, and search engine system
CN112445997A (en) Method and device for extracting CMS multi-version identification feature rule
CN105528357A (en) Webpage content extraction method based on similarity of URLs and similarity of webpage document structures
CN105550359A (en) Webpage sorting method and device based on vertical search and server
JP4231298B2 (en) Information extraction rule creation system, information extraction rule creation program, information extraction system, and information extraction program
US20120054598A1 (en) Method and system for viewing web page and computer Program product thereof
CN104079623A (en) Method and system for controlling multilevel cloud storage synchrony
CN101727471A (en) Website content retrieval system and method
CN106407288B (en) Method and system for synchronously updating information

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20160720

WD01 Invention patent application deemed withdrawn after publication