CN101950312B - Method for analyzing webpage content of internet - Google Patents

Method for analyzing webpage content of internet Download PDF

Info

Publication number
CN101950312B
CN101950312B CN2010105127301A CN201010512730A CN101950312B CN 101950312 B CN101950312 B CN 101950312B CN 2010105127301 A CN2010105127301 A CN 2010105127301A CN 201010512730 A CN201010512730 A CN 201010512730A CN 101950312 B CN101950312 B CN 101950312B
Authority
CN
China
Prior art keywords
webpage
template
web page
resolved
piecemeal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN2010105127301A
Other languages
Chinese (zh)
Other versions
CN101950312A (en
Inventor
赵清政
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN2010105127301A priority Critical patent/CN101950312B/en
Publication of CN101950312A publication Critical patent/CN101950312A/en
Application granted granted Critical
Publication of CN101950312B publication Critical patent/CN101950312B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention provides a method for analyzing the webpage content of the internet, which belongs to the technical field of a network. The method comprises the following steps of: firstly, initializing a webpage template library and reading a webpage to be analyzed; secondly, judging whether the webpage to be analyzed is generated by a template or not according to the url of the webpage, if the webpage to be analyzed is not generated by the template, analyzing the webpage according to the common mode, if the webpage to be analyzed is generated by the template, generating a hash value for the catalogue of the webpage to be analyzed, searching whether the value exists in a template hash table of the webpage template library or not according to the generated hash value, if so, analyzing the webpage to be analyzed according to the template corresponding to the value, and otherwise searching a webpage with the same type as that of the webpage to be analyzed and generating the template corresponding to the webpage to be analyzed by using the searched same type webpage; and finally, analyzing the webpage to be analyzed by using the template. The method has the advantages of greatly improving the analyzing accuracy of the webpage and greatly improving the analyzing effect of the webpage.

Description

A kind of method for analyzing internet web page contents
Technical field
The invention belongs to networking technology area, be specifically related to a kind of method for analyzing internet web page contents.
Background technology
In recent years, be accompanied by the popularizing of network, the lifting of bandwidth, the maturation of service mode, search engine becomes the mainstream applications of internet gradually.Technically, internet search engine generally is made up of two parts, i.e. processed offline part and online treatment part.The processed offline part comprises that mainly webpage grasps; Main functional modules such as index are resolved and set up to webpage; And online treatment module flow process comprises: the query word of submitting to according to the user, the document of inquiry correspondence, i.e. webpage in index that the processed offline module generates and data; And with the document that inquires according to the ordering of certain index, the result after the ordering returns to the user the most at last.
In the whole service process of search engine, webpage is resolved and to be brought into play basic key effect, and it in fact determines which data and content with generating index, thereby can finally be arrived by user inquiring.Because technology and commercial, the content in each current webpage are all very complicated, except the content that really will express of webpage, a lot of irrelevant informations of also having mixed, for example advertising message, rubbish link etc.Experience because the accuracy that webpage is resolved has influenced the final user of search engine service to a great extent, have a variety of methods to be invented in order to improve the parsing to web page contents at present, these two kinds of methods can classify as:
First kind of mode with character stream according to each label and the positional information in webpage, counts the characteristic of various piece, goes out the title and the text of webpage and other parts according to their signature analysis.
Second kind is the method with DOM Document Object Model (Document Object Model is called for short dom) tree.Build the dom tree according to original web page earlier, judge the content of webpage at the attribute of each node of comparison tree.
These two kinds of methods all are to utilize one group of rule that formulation is good in advance to choose some content in the webpage in essence.Regrettably, the arranged mode of webpage is multifarious now, can't be exhaustive; These methods exist in actual motion possibly be suitable for some webpage; And the defective of inapplicable other webpage, perhaps there is junk information in the net result that webpage is resolved, has perhaps lost real Useful Information.
Summary of the invention
The present invention is directed to present method for analyzing internet web page contents and can not be suitable for the inaccurate problem of result of whole webpages and parsing fully, a kind of method for analyzing internet web page contents is provided.
Method for analyzing internet web page contents provided by the invention, it comprises the steps:
Webpage to be resolved is read in step 1, initialization web page template storehouse; A template Hash table is set up in described Web page module storehouse, all corresponding template of each ident value dirID of record in this template look-up table;
Step 2, judge according to the URL url of webpage to be resolved whether webpage to be resolved is generated by template, if not, execution in step three, otherwise execution in step four;
Step 3, resolve this webpage, obtain analysis result according to common mode;
Step 4, to the catalogue of webpage to be resolved, generate an ident value dirID through hash method to it, and in the template look-up table in web page template storehouse, search whether there is corresponding dirID, if having execution in step six, otherwise execution in step five;
Step 5, find other webpages of the same type with webpage to be resolved; With generating template; Generate the required webpage number of template otherwise less than the threshold value of minimum webpage; Generate the template corresponding according to all webpages that get access to,, then upgrade the fingerprint Hash table if adopt a fingerprint Hash table to write down the eigenwert of the rubbish piecemeal of all templates in the web page template storehouse with webpage to be resolved; If the template to each generation is set up a fingerprint Hash table; Then preserve the corresponding fingerprint Hash table of template of the generation of setting up, and add in the template Hash table in web page template storehouse, the template that generates is joined in the web page template storehouse through catalogue corresponding identification value dirID with webpage to be resolved;
Step 6, utilization and the corresponding template of catalogue corresponding identification value dirID of webpage to be resolved are resolved the content of webpage to be resolved, obtain analysis result, specifically: the content of webpage to be resolved is carried out piecemeal; And generate a cryptographic hash for each piecemeal according to the content of each piecemeal; To each cryptographic hash, search whether there is this cryptographic hash in the fingerprint Hash table of the template that fingerprint Hash table in the web page template storehouse or webpage to be resolved are corresponding, if exist; Then the corresponding piecemeal of this cryptographic hash is not dealt with; If do not exist, then extract the corresponding piecemeal content of this cryptographic hash, all piecemeal contents of extracting have constituted the content of analysis result.
Webpage of the same type described in the step 5 is meant the webpage that has same directory in the static Web page, or in the dynamic web page, with the webpage under the same down basic class of inlet.
The generation of the template described in the step 5 specifically may further comprise the steps: steps A, all carry out piecemeal to the content of all webpages that get access to; Step B, according to the content of each piece, all generate an eigenwert for each piecemeal, this eigenwert adopts hash method to generate; Step C, according to the eigenwert of piecemeal, the frequency of occurrences of adding up every kind of piecemeal; Step D, the frequency of occurrences is labeled as the rubbish piecemeal greater than the piecemeal of pre-set threshold, each rubbish piecemeal characteristic of correspondence value is saved in the fingerprint Hash table; Step e, if set up a fingerprint Hash table to the template of each generation, then be that the catalogue of webpage to be resolved is related with corresponding fingerprint Hash table foundation.
Content to webpage in said steps A and the step 6 is carried out piecemeal, should guarantee the consistance and the indeformable property of segmenting web page, carries out the nature cutting with label tr, td and div, and the length setting is no less than 20 bytes; The simple part of structure of web page is cut into bulk, and length is not limit.
The pre-set threshold of said step D, minimum value are 3, greater than 30 o'clock, are n at n 0.3The value that rounds up, but maximum occurrences is 10, wherein n is the number that generates being used to of getting access to the webpage of template.
A kind of method for analyzing internet web page contents provided by the invention; Automatically whether analyzing web page is generated by template; And can find the template corresponding automatically with webpage; Thereby utilize the most adaptive template to come analyzing web page, can improve the accuracy that webpage is resolved widely, significantly improve the effect of web page analysis.
Description of drawings
Fig. 1 is the flow chart of steps of internet web page analytic method of the present invention;
Fig. 2 is the process flow diagram that template generates in the step 5 in the internet web page analytic method of the present invention.
Embodiment
To combine accompanying drawing and embodiment that the present invention is done further detailed description below.
The objective of the invention is to can not accurate Analysis to whole webpages to prior art; The unfavorable defective of analysis result; The different channel page or leaf of a kind of ability to each website even each website is provided, the method for analyzing internet web page contents of carrying out the analysis and the processing of webpage with method targetedly.
A kind of method for analyzing internet web page contents of the present invention, as shown in Figure 1, specifically may further comprise the steps:
Webpage to be resolved is read in step 1, initialization web page template storehouse.
For example, the URL of webpage to be resolved (Uniform Resource Locator is called for short url) is news.sina.com.cn, need read the url and corresponding original web page of this webpage so.
In original state, the template number in the Web page module storehouse is 0.Each module in the web page template storehouse all adopts an ident value dirID to identify, and all dirID are kept in the template Hash table, and described template Hash table adopts the mode of Hash (hash) table to store data.
Step 2, judge that whether webpage to be resolved is generated by template, if not, execution in step three, otherwise execution in step four;
According among the url that checks webpage to be resolved whether except " // " behind the http, also have the sign "/" of catalogue, judge that whether webpage to be resolved is generated by template, generate if exist, if do not exist with regard to being not to generate by template by template.
For example, webpage to be resolved: news.sina.com.cn judges according to url whether this webpage is that template generates.Can find out that from the url of this webpage this webpage is the news channel page or leaf of sina.com.cn, be not that template generates, and when resolving this webpage, owing to be not that template generates, change step 3 and carry out.
For example; This webpage to be resolved of http://news.sina.com.cn/h/2010-07-15/141820685517.shtml; The catalogue that is easy to judge it according to the url of this webpage is " http://news.sina.com.cn/h/2010-07-15 "; Be the part before last "/",, need change step 4 when resolving this webpage and carry out because of this webpage is generated by template.
To webpage http://item.taobao.com/item.htm? Id=6660646078&cm_cat=110207, its catalogue is http://item.taobao.com, this webpage is generated by template.
Step 3, press common mode and resolve this webpage, obtain analysis result, finish the resolving of this webpage.The method that said common mode refers to adopt the mode of character stream or adopts the dom tree utilizes one group of rule that formulation is good in advance to choose the feature in the webpage.
Step 4, judge whether there has been the template that is complementary with webpage to be resolved in the web page template storehouse, if, execution in step six, otherwise execution in step five.
Catalogue to webpage to be resolved generates an ident value dirID, and identical catalogue has identical dirID.The embodiment of the invention adopts hash method to generate dirID, for catalogue " http://news.sina.com.cn/h/2010-07-15 ", supposes that the dirID that generates according to this catalogue is 14130464512028122877; Be expressed as 0xc4197e9b76b31efd with 16 systems; The ident value dirID that representes with 16 systems inquires about in the template Hash table in Web page module storehouse, if should not be worth in the template Hash table, does not then have pairing template in the web page template storehouse; Changeing step 5 carries out; Should be worth if exist in the template Hash table in Web page module storehouse, then have pairing template in the web page template storehouse, and change step 6 and carry out.
Step 5, find the webpage of the same type of webpage to be resolved, generate the template corresponding, join the ident value dirID of the catalogue of webpage to be resolved in the template Hash table in web page template storehouse and go with webpage to be resolved.
Webpage of the same type to static Web page, generally is meant the webpage under the same directory, to the webpage of dynamic generation, is meant with all webpages under the same down basic class of inlet.
To the webpage of dynamic generation, defining on the website all webpages under the same little type under the identical inlet is one basic type, and described same little type is meant the type that can be used to classify.For example: http://item.taobao.com/item.htm? Id=4283563695&cm_cat=110207 and urlhttp: //item.taobao.com/item.htm? Id=6660646078&cm_cat=110207; The url of these two webpages belongs to cm_cat=110207, and cm_cat=110207 is exactly one basic type.
When generating new template, at first need find webpage to be resolved other webpages of the same type; Generate the needed webpage number of new template and be greater than the threshold value that equals to generate the required minimum webpage of template; Comprise webpage to be resolved according to all webpages that get access to then, generate the template corresponding with webpage to be resolved.The threshold value of the required minimum webpage of said generation template is an integer, and minimum is 3, considers based on the angle of probability, and is the more the better, generally gets 10 more than the webpage.If the webpage of getting under the same catalogue very little, the extraction of template will keep the Template Information that should filter out.The webpage that the template generation needs is minimum to be 3, and this moment, the prerequisite of acquiescence was: have only a template under this catalogue, do not have other nested templates, or these 3 webpages belong to same template.
According to known web pages: http://news.sina.com.cn/h/2010-07-15/075320682851.shtml, its web page contents is analyzed, find out the webpage of the same type of more this webpage as far as possible.If do not comprise other webpages of the same type in this web page contents; Or the number that comprises webpage of the same type is less than the threshold value of the minimum webpage of setting up the template needs, so just needs to seek the path of finding other webpages, at first searches in this website; The template that has or not other; If have, just use for reference this website and generate the mode of finding other webpage paths when having template, seek the url of other webpages according to relative path; For example already present template finds that the path of a webpage under the corresponding catalogue is: http://news.sina.com.cn/h/2010-08-26/105620979437.shtml; Then the webpage of relative path is http://news.sina.com.cn/h/2010-07-15/105620979437.shtml, checks whether the webpage of this relative path exists, if exist then find a webpage of the same type.If do not set up other template under this website, that just sees that other website has or not the ready-made form can be for reference, and described ready-made form refers to find from a url method of a plurality of webpage url of the same type; If have; Just go this method on probation one by one, search the webpage of relative path, up to other webpages that find under this enough catalogue; If do not have; That just seeks the higher level in this webpage url continues to seek other webpages under this catalogue from higher level url, up to searching out enough webpage numbers with the generation template.And write down the path of searching, so that the foundation of these other templates of website and enrich the means of the foundation of other website templates.
Find the webpage of the same type of 9 webpages to be resolved in the embodiment of the invention, 10 webpages that comprise webpage to be resolved are analyzed, generate the template corresponding with webpage to be resolved.At last, with the dirID:0xc4197e9b76b31efd of the catalogue of webpage http://news.sina.com.cn/h/2010-07-15/075320682851.shtml, join in the template Hash table in web page template storehouse and go.
As shown in Figure 2, the method that the corresponding templates of webpage to be resolved generates is specially:
Steps A, all carry out piecemeal to the page of all webpages that get access to;
Step B, all generate an eigenwert for each piecemeal according to the content of each piecemeal, piece fingerprint just, the piece fingerprint adopts hash method to generate, and each piece is represented with a cryptographic hash, a plurality of cryptographic hash of each page correspondence;
Step C, according to the eigenwert of piecemeal, the frequency of occurrences of adding up every kind of piecemeal;
Step D, the frequency of occurrences is labeled as the rubbish piecemeal greater than the piecemeal of pre-set threshold, all rubbish piecemeals are formed the rubbish block collection, and each rubbish piecemeal characteristic of correspondence value in the rubbish block collection is saved in the ATL;
Step e is also corresponding with corresponding rubbish block collection with the catalogue of webpage to be resolved, and the catalogue that is webpage to be resolved is set up related with corresponding fingerprint Hash table.
For the piecemeal of above-mentioned steps A webpage, need carry out cutting to webpage according to certain rule, guarantee the consistance of cutting and the non-possibility of accidental collision.Webpage all has a lot of structurized data to constitute, and such as the p node, a node, the rower of webpage are signed tr, column label td, layer label div etc., also are that this type of structured of using webpage itself is analyzed when coming analyzing web page with the thought of rule.
In general, it is also just fast more that the piece of cutting deals with speed more greatly, because piece has lacked; Data volume is also just few, but accuracy rate is just low more, because piece is big; The template part that just possibly comprise part also has been used as the personal characteristics of webpage to the template corresponding part, and recall rate is just high more.For example handle each webpage as a piecemeal, recall rate must be 100%.Corresponding: the more little speed of the piece of cutting is just slow more, and accuracy rate is just high more, and recall rate is just low more.
In order to guarantee both balances, cutting should be separation with the nature node.Generally with tr, td, labels such as div carry out the nature cutting; Purpose is to guarantee the consistance of cutting, guarantees as much as possible that promptly identical content no matter in any position, all is syncopated as identical result as much as possible; This requirement requires important especially to the ending of webpage; If because there is the inconsistency of cutting, its error can accumulate gradually so, arrived more obvious that the back inconsistency of webpage can show.Length generally is controlled at and is no less than 20 bytes, can increase the probability increase that different content generates identical fingerprints because length is too short, also can cause the repetition of a webpage self, and excessive weak point also can increase operand simultaneously, does not also have practical meaning.The cut-point of suitable selection cutting is for the non-possibility that guarantees to collide and the consistance of cutting with cutting apart length; Guarantee that from the angle of the probability statistics of mathematics different contents does not have identical fingerprint; The piece of cutting is big more, just might destroy the consistance of cutting more.But the size of the piece of cutting is also closely-related with the structure of webpage, can handle with bulk in the structure of web page simple parts, and this moment, the basic premise of piecemeal was: guaranteeing under the uncomplicated prerequisite of structure of web page that can try one's best is cut into bulk.Because big piece simple in structure can play the correcting of cutting.The byte that the rower of the webpage in the literal that the structure of web page simple parts is meant this part and this part is signed label literal such as tr, column label td, layer label div is than more than 10: 1.
Integrated some: with tr, td, label nodes such as div carry out the nature cutting; Length is controlled at and is no less than 20 bytes; The simple part of structure of web page is wanted cutting as much as possible: length is not limit.
In the time of concrete cutting, can begin from first character of webpage, the node that scanning is set is td such as the node of setting, tr, and div etc., if run into these nodes, just the position is set to the starting position of piece here.Use the same method then and go for next position; If position adjacent apart from length greater than the minimum length of setting; It is 20 bytes that minimum length is set here, just is used as one to the part in the middle of two positions, and the content to this piece adopts hash method to generate corresponding fingerprint then.The end position of setting this piece simultaneously is exactly the starting position of next piece; If the distance of position adjacent is less than minimum length; Just continue to seek next node, it is invalid that middle node just is made as, up to the distance of the node that finds a node and this piece to start greater than 20 bytes; Or find the ending of webpage, generate fingerprint to it.
The generation of concrete fingerprint value has different fingerprint values in order to guarantee different pieces, i.e. the non-collision property of fingerprint value; Select encryption method reliably for use; What use in the embodiment of the invention is the Hash encryption method, and this method of experiment proof is reliably, can guarantee the non-collision property of fingerprint value.
Among the step C, at first count the number of the webpage of the same type that gets access to, be put into the fingerprint of all pieces under this catalogue in the fingerprint Hash table again, and the occurrence number of adding up every kind of piecemeal, the number of piece fingerprint repetition just.Described fingerprint Hash table adopts the mode of Hash (hash) table to store data, and the size of fingerprint Hash table is relevant with webpage number of the same type, generally is 20 times of the webpage number of the same type that gets access to, the rarest 10,000 node capacity.
Among the step D, the fingerprint Hash table is traveled through, will think the template fingerprint of this catalogue, and be kept at this template fingerprint in the fingerprint Hash table of setting up in the web page template storehouse more than the piece fingerprint of pre-set threshold according to certain threshold value.Described fingerprint Hash table can be set up a table to each template, also can set up a big table to all templates.
Draw the template of this catalogue in this step according to certain rule, comprise a plurality of subtemplates automatically under possible this template.Pre-set threshold is an integer, and minimum is 3, is 10 to the maximum, preferably is made as 5, generally when n greater than 30 the time, get n 0.3The value that rounds up, wherein n is the number that generates being used to of getting access to all webpages of the same type that comprise webpage to be resolved of template.The selection of threshold value also is according to collision probability on the mathematics and practical application and the numerical value that balance is come out.Because guaranteed in the time of piecemeal that the identical collision probability of fingerprint of different content is extremely low, the rule that generates according to webpage has again guaranteed that the non-template part also has different fingerprints, thereby qualitative, quantitative assurance the accuracy of template identification.
Step e is meant the ident value dirID into the catalogue of webpage to be resolved, and is related with each rubbish piecemeal characteristic of correspondence value foundation in the rubbish block collection, the fingerprint Hash table of setting up among the corresponding step C of the ident value dirID of a catalogue.In practical application, this step e also can be set up the ident value dirID of catalogue and the eigenwert of rubbish piecemeal related, and directly the eigenwert with the rubbish piecemeal of all catalogues all is kept in the total fingerprint Hash table.
The corresponding template of step 6, utilization and webpage to be resolved is resolved the content of this webpage.
Webpage http://news.sina.com.cn/h/2010-07-15/075320682851.shtml for example; The same catalogue http://news.sina.com.cn/s/2010-07-15 that obtains this url; Obtain the ident value dirID:0xc4197e9b76b31efd of this catalogue with the hash method analysis, in the template Hash table in Web page module storehouse, seek whether there is this ident value dirID.Because this template generates, so can find this template corresponding identification value dirID in the template Hash table in the web page template storehouse.
Content that at first will this webpage to be resolved is carried out piecemeal; And generate a cryptographic hash of correspondence for according to the content of this piecemeal each piecemeal that splits; Each cryptographic hash to generating is searched under the corresponding fingerprint Hash table of template; Perhaps in total fingerprint Hash table, search,, just explain that this piece is the template part that machine generates if there is this cryptographic hash; Just explain that this piece is the personal characteristics part of webpage if can not find this cryptographic hash.All personal characteristics parts of extracting this webpage have just constituted the main contents of this webpage.Described content with this webpage is carried out piecemeal, and concrete block division method is identical with steps A in the step 5.

Claims (8)

1. a method for analyzing internet web page contents is characterized in that, this method specifically may further comprise the steps:
Webpage to be resolved is read in step 1, initialization web page template storehouse; Establish a template Hash table in the described web page template storehouse, all corresponding template of each the ident value dirID that writes down in this template Hash table;
Step 2, judge according to the URL url of webpage to be resolved whether webpage to be resolved is generated by template, if not, execution in step three, otherwise execution in step four;
Concrete judge that whether webpage to be resolved is generated by template is whether basis checks among the URL url of webpage to be resolved except " // " after " http "; The sign "/" that also has catalogue; This webpage to be resolved is generated by template if exist then, and this webpage to be resolved is not generated by template if do not exist then;
Step 3, resolve this webpage, obtain analysis result, finish the resolving of this webpage according to common mode;
Step 4, to the catalogue of webpage to be resolved, generate an ident value dirID through hash method, and in the template Hash table in web page template storehouse, search whether there is this ident value dirID, if having execution in step six, otherwise execution in step five;
Step 5, obtain the webpage of the same type of webpage to be resolved; Generate the template corresponding with webpage to be resolved; If adopt a fingerprint Hash table to write down the eigenwert of the rubbish piecemeal of all templates in the web page template storehouse; Then upgrade the fingerprint Hash table,, then preserve the corresponding fingerprint Hash table of template of the generation of setting up if set up a fingerprint Hash table to the template of each generation; Catalogue corresponding identification value dirID through with webpage to be resolved joins in the template Hash table in web page template storehouse, and the template that generates is joined in the web page template storehouse; The needed webpage number of described generation template otherwise less than the threshold value of minimum webpage;
Described webpage of the same type is meant the webpage that has same directory in the static Web page, or in the dynamic web page, with the webpage under the same down basic class of inlet;
Step 6, the content of webpage to be resolved is carried out piecemeal; And generate a cryptographic hash for each piecemeal according to the content of each piecemeal; To each cryptographic hash, search whether there is this cryptographic hash in the fingerprint Hash table of the template that fingerprint Hash table in the web page template storehouse or webpage to be resolved are corresponding, if exist; Then the corresponding piecemeal of this cryptographic hash is not dealt with; If do not exist, then extract the content of the corresponding piecemeal of this cryptographic hash, all piecemeal contents of extracting have constituted the content of analysis result.
2. a kind of method for analyzing internet web page contents according to claim 1 is characterized in that, the threshold value that the template of generation described in the step 5 needs minimum webpage is more than or equal to 3.
3. a kind of method for analyzing internet web page contents according to claim 1 and 2 is characterized in that, it is 10 that the template of generation described in the step 5 needs the threshold value of minimum webpage.
4. a kind of method for analyzing internet web page contents according to claim 1 is characterized in that, the generation of the template described in the step 5 specifically may further comprise the steps:
Steps A, all carry out piecemeal to the content of all webpages that get access to;
Step B, all generate an eigenwert for each piecemeal, this eigenwert adopts hash method to generate;
Step C, according to the eigenwert of piecemeal, the frequency of occurrences of adding up every kind of piecemeal;
Step D, the frequency of occurrences is labeled as the rubbish piecemeal greater than the piecemeal of pre-set threshold, each rubbish piecemeal characteristic of correspondence value is saved in the fingerprint Hash table;
Step e, if set up a fingerprint Hash table to the template of each generation, then be that the ident value dirID of webpage catalogue to be resolved is related with corresponding template fingerprint table foundation.
5. according to claim 1 or 4 described a kind of method for analyzing internet web page contents; It is characterized in that; In the described step 6 content of webpage to be resolved is carried out all carrying out piecemeal to the content of all webpages that get access in piecemeal or the described steps A; The concrete method of partition that described content to webpage is carried out piecemeal is: carry out the nature cutting with label tr, td and div; The length setting is no less than 20 bytes, and the cutting length of the simple part of structure of web page is not limit, and wherein tr, td and div represent rower label, column label, the layer label of webpage respectively.
6. a kind of method for analyzing internet web page contents according to claim 4 is characterized in that, the described pre-set threshold of step D, and minimum is set at 3, greater than 30 o'clock, gets n at n 0.3The value that rounds up, but maximum occurrences is 10, wherein n representes to get access to is used to generate the number of the webpage of template.
7. a kind of method for analyzing internet web page contents according to claim 6 is characterized in that described pre-set threshold is set at 5.
8. a kind of method for analyzing internet web page contents according to claim 1 is characterized in that, described template Hash table and fingerprint Hash table all adopt the mode of Hash table to store data.
CN2010105127301A 2010-08-18 2010-10-20 Method for analyzing webpage content of internet Expired - Fee Related CN101950312B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2010105127301A CN101950312B (en) 2010-08-18 2010-10-20 Method for analyzing webpage content of internet

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
CN201010256575.1 2010-08-18
CN201010256575 2010-08-18
CN2010105127301A CN101950312B (en) 2010-08-18 2010-10-20 Method for analyzing webpage content of internet

Publications (2)

Publication Number Publication Date
CN101950312A CN101950312A (en) 2011-01-19
CN101950312B true CN101950312B (en) 2012-07-04

Family

ID=43453814

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2010105127301A Expired - Fee Related CN101950312B (en) 2010-08-18 2010-10-20 Method for analyzing webpage content of internet

Country Status (1)

Country Link
CN (1) CN101950312B (en)

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102790967B (en) * 2011-05-19 2015-02-04 华晶科技股份有限公司 Wireless network access method
CN103365865B (en) * 2012-03-29 2017-07-11 腾讯科技(深圳)有限公司 Date storage method, data download method and its device
CN103853656B (en) * 2012-11-30 2016-08-10 阿里巴巴集团控股有限公司 Webpage method of testing and device
EP3941015A1 (en) * 2012-12-28 2022-01-19 Huawei Technologies Co., Ltd. Method, apparatus, and network system for identifying website
CN104111928A (en) * 2013-04-17 2014-10-22 北京百度网讯科技有限公司 Web page building method, web page rendering method, web page building device and web page rendering device
CN103345532A (en) * 2013-07-26 2013-10-09 人民搜索网络股份公司 Method and device for extracting webpage information
CN103593467B (en) * 2013-11-26 2017-05-24 优视科技有限公司 Method and device for generating webpage template and achieving incremental transmission
WO2015078231A1 (en) 2013-11-26 2015-06-04 优视科技有限公司 Method for generating webpage template and server
CN105574004B (en) * 2014-10-10 2019-06-21 阿里巴巴集团控股有限公司 A kind of removing duplicate webpages method and apparatus
CN108021598B (en) * 2016-11-04 2022-05-03 阿里巴巴(中国)有限公司 Page extraction template matching method and device and server
CN111401021A (en) * 2018-12-17 2020-07-10 北大方正集团有限公司 Publication template construction method, device, equipment and computer-readable storage medium
CN111125565A (en) * 2019-11-01 2020-05-08 上海掌门科技有限公司 Method and equipment for inputting information in application
CN113434612B (en) * 2021-07-09 2024-01-26 青岛海尔科技有限公司 Data statistics method and device, storage medium and electronic device

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101192234A (en) * 2007-06-07 2008-06-04 腾讯科技(深圳)有限公司 Searching system and method based on web page extraction
CN101276362A (en) * 2007-03-26 2008-10-01 国际商业机器公司 Apparatus and method for optimizing and differencing web page browsing
CN101464905A (en) * 2009-01-08 2009-06-24 中国科学院计算技术研究所 Web page information extraction system and method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040060008A1 (en) * 2002-01-18 2004-03-25 John Marshall Displaying statistical data for a web page by dynamically modifying the document object model in the HTML rendering engine

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101276362A (en) * 2007-03-26 2008-10-01 国际商业机器公司 Apparatus and method for optimizing and differencing web page browsing
CN101192234A (en) * 2007-06-07 2008-06-04 腾讯科技(深圳)有限公司 Searching system and method based on web page extraction
CN101464905A (en) * 2009-01-08 2009-06-24 中国科学院计算技术研究所 Web page information extraction system and method

Also Published As

Publication number Publication date
CN101950312A (en) 2011-01-19

Similar Documents

Publication Publication Date Title
CN101950312B (en) Method for analyzing webpage content of internet
EP2447864A1 (en) Update notification method and system
US8977606B2 (en) Method and apparatus for generating extended page snippet of search result
JP2006004417A (en) Method and device for recognizing specific type of information file
US20130185429A1 (en) Processing Store Visiting Data
CN101727447A (en) Generation method and device of regular expression based on URL
CN103324622A (en) Method and device for automatic generating of front page abstract
CN103617174A (en) Distributed searching method based on cloud computing
CN101916285A (en) Method and device for analyzing internet web page contents
CN104268148A (en) Forum page information auto-extraction method and system based on time strings
US11249993B2 (en) Answer facts from structured content
CN105550359A (en) Webpage sorting method and device based on vertical search and server
CN107862039A (en) Web data acquisition methods, system and Data Matching method for pushing
CN104765882A (en) Internet website statistics method based on web page characteristic strings
CN103118028B (en) Based on the security sweep method and system of web analysis
CN106339381B (en) Information processing method and device
Lin et al. Combining a segmentation-like approach and a density-based approach in content extraction
CN109948015B (en) Meta search list result extraction method and system
CN103324640B (en) A kind of method, device and equipment determining search result document
CN115796146A (en) File comparison method and device
CN103870590A (en) Webpage identification method and device with error-reported characteristic
Thamviset et al. Bottom-up region extractor for semi-structured web pages
CN111782958A (en) Recommendation word determining method and device, electronic device and storage medium
JP2013254366A (en) Information processing device and related word determination method
JP5903372B2 (en) Keyword relevance score calculation device, keyword relevance score calculation method, and program

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20120704

Termination date: 20141020

EXPY Termination of patent right or utility model