CN103345532A - Method and device for extracting webpage information - Google Patents

Method and device for extracting webpage information Download PDF

Info

Publication number
CN103345532A
CN103345532A CN2013103202797A CN201310320279A CN103345532A CN 103345532 A CN103345532 A CN 103345532A CN 2013103202797 A CN2013103202797 A CN 2013103202797A CN 201310320279 A CN201310320279 A CN 201310320279A CN 103345532 A CN103345532 A CN 103345532A
Authority
CN
China
Prior art keywords
dom
sample
webpage
unit
identity label
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2013103202797A
Other languages
Chinese (zh)
Inventor
李杨瑞
崔世起
杨青
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
PEOPLE SEARCH NETWORK AG
Original Assignee
PEOPLE SEARCH NETWORK AG
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by PEOPLE SEARCH NETWORK AG filed Critical PEOPLE SEARCH NETWORK AG
Priority to CN2013103202797A priority Critical patent/CN103345532A/en
Publication of CN103345532A publication Critical patent/CN103345532A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method and a device for extracting webpage information. The method comprises the following steps of: determining the identity tag of a to-be-extracted webpage according to the page information of the to-be-extracted webpage; looking up a sample set corresponding to the identity tag of the to-be-extracted webpage in a sample database, wherein the sample set comprises at least one DOM (document object model) sample; selecting one of the at least one DOM sample as a present DOM sample, and matching the present DOM sample with a DOM structure analysed from the to-be-extracted webpage; if the matching is successful, then positioning nodes with to-be-extracted information in the DOM structure according to the position of the to-be-extracted information in the present DOM sample, thus obtaining the to-be-extracted information by virtue of the nodes; if the matching is unsuccessful, then continuing to execute the step of selecting the present DOM sample, and returning a message that extraction is failed until matching for each DOM sample is failed. According to the method and the device for extracting webpage information disclosed by the invention, the influence of the changes of a webpage structure on an information extraction process can be furthest reduced, so as to realize reliable and accurate extraction for webpage information.

Description

A kind of method for abstracting web page information and device
Technical field
The present invention relates to networking technology area, be specifically related to a kind of method for abstracting web page information and device.
Background technology
Along with the continuous development of Internet technology, the internet has become important information promulgating platform, how to obtain the information that the user needs fast and accurately from the internet, becomes a problem demanding prompt solution.Web page information extraction, the internet as information source, is obtained the user's interest webpage from different information sources, carry out information extraction after, the information that extracts is left in the database, make the user can utilize the information of database to carry out information inquiry, search, data mining or data analysis.The purpose of Web page information extraction is that the semi-structured information extraction that web page textization presents is come out, and it is expressed as structurized data, thereby reluctant text message is converted to the structural data of easy processing, analysis.
At present, method for abstracting web page information commonly used mainly realizes that based on general-purpose algorithm concrete steps are as follows:
At first, analyze the info web of different websites, the rule of all clauses and subclauses (as title, text, time etc.) that this website comprises is set according to its feature that has.For example, the feature that the title of analysis Sina News class webpage has, and according to its abstract at least one decimation rule at title.
Secondly, add up the rule that different websites arrange at same clauses and subclauses, extract the rule that wherein has general character, simultaneously, also will record each website at the special rules that these clauses and subclauses have, guarantee recall rate.
Then, when needs carry out Web page information extraction, load earlier the rule for the treatment of extraction information correspondence, judge and wait to extract whether have the information that conforms to the rule that loads in the webpage: again with wait to extract the webpage coupling, if exist, then extract this information; If there is no, then judge the extraction failure.
Just there is following shortcoming in scheme like this:
Because will analyze the webpage of different websites one by one, could corresponding decimation rule be set according to analysis result, cause decimation rule that cycle is set is longer.
In addition, in order to reduce the data volume that database is preserved, though can extract different websites at the general character rule of same clauses and subclauses, each website also may have a lot of special ruless, and this has just limited the effect that reduces the data reserve capacity greatly; Meanwhile, because each website may have separately special rules at same clauses and subclauses, and may mutual exclusion between these special ruless, therefore can't extract the info web that these have the website of mutual exclusion rule simultaneously, as, general website is that the source of news of some government class website then is positioned at after the text near the title, before the text for the location rule that source of news extracts, and obviously can not utilize location rule to extract the source of news of these two types of websites simultaneously.
Summary of the invention
The method for abstracting web page information of the embodiment of the invention and device, the purpose of Web page information extraction is carried out in realization accurately and reliably.
For this reason, the embodiment of the invention provides following technical scheme:
A kind of method for abstracting web page information, described method comprises:
Determine the described identity label of waiting to extract webpage according to waiting the page info that extracts webpage;
Search the described identity label corresponding sample set of waiting to extract webpage in sample database, described sample set comprises at least one DOM Document Object Model DOM sample;
From described at least one DOM sample, choose one as current DOM sample, and the DOM structure of utilizing described current DOM sample and webpage described to be extracted to parse is complementary:
If the match is successful, then basis is treated position location described the treat extraction information of extraction information in described current DOM sample at the node of described DOM structure, and utilizes described node to obtain the described extraction information for the treatment of;
If it fails to match, then continue to carry out the described step of choosing current DOM sample, return the extraction failure until each DOM sample standard deviation after it fails to match.
Preferably, the mode of creating described sample database is:
The DOM structure that analyzing web page parses, and extract the DOM sample of described DOM structure correspondence;
Identity label with described webpage is that key word is organized into sample set with the DOM sample that extracts;
Set up the corresponding relation between described sample set and described identity label, and described sample set and described corresponding relation are saved in the described sample database.
Preferably, the DOM sample of the described DOM structure of described extraction correspondence comprises:
Judge between two DOM structures whether have compatibility at least, if having, then utilize asterisk wildcard that each self-corresponding DOM sample of described at least two DOM structures is synthesized one; If do not have, then extract the DOM sample of each DOM structure correspondence.
Preferably, described identity label is site name, substation point title or the signature of building the instrument maker.
Preferably, describedly in sample database, search the described identity label corresponding sample set of waiting to extract webpage, comprising:
Calculate the described cryptographic hash of waiting to extract the identity label of webpage;
Search default Hash table, the described identity numbering of waiting to extract the identity label of webpage of determining described cryptographic hash correspondence;
Determine and its corresponding sample set according to described identity numbering.
A kind of Web page information extraction device, described device comprises:
The identity label determining unit is used for according to waiting that the page info that extracts webpage determines the described identity label of waiting to extract webpage;
Search the unit, be used for searching the described identity label corresponding sample set of waiting to extract webpage at sample database, described sample set comprises at least one DOM Document Object Model DOM sample;
Choose the unit, be used for choosing one as current DOM sample from described at least one DOM sample;
Matching unit, the DOM structure that is used for utilizing described current DOM sample and webpage described to be extracted to parse is complementary:
Positioning unit is used at described matching unit when the match is successful, according to treating position location described the treat extraction information of extraction information in described current DOM sample at the node of described DOM structure, and utilizes described node to obtain the described extraction information for the treatment of;
Notification unit is used for when it fails to match, notifying the described unit of choosing to continue to choose current DOM sample at described matching unit, returns the extraction failure until each DOM sample standard deviation after it fails to match.
Preferably, described device also comprises:
DOM sample extraction unit is used for the DOM structure that analyzing web page parses, and extracts the DOM sample of described DOM structure correspondence;
Organization unit, the identity label that is used for described webpage is that key word is organized into sample set with the DOM sample that extracts;
Preserve the unit, be used for setting up the corresponding relation between described sample set and described identity label, and described sample set and described corresponding relation are saved in the described sample database.
Preferably, described DOM sample extraction unit comprises:
Judging unit is used for judging between two DOM structures whether have compatibility at least;
Synthesis unit is used for utilizing asterisk wildcard that each self-corresponding DOM sample of described at least two DOM structures is synthesized one when described at least two the DOM structures of described judgment unit judges have compatibility;
Extract subelement, be used for when described at least two the DOM structures of described judgment unit judges do not have compatibility, extracting the DOM sample of each DOM structure correspondence.
Preferably, the described unit of searching comprises:
Computing unit is used for calculating the cryptographic hash that the identity label of webpage is extracted in definite the waiting of described identity label determining unit;
Identity numbering determining unit is used for searching default Hash table, the described identity numbering of waiting to extract the identity label of webpage of determining described cryptographic hash correspondence;
The sample set determining unit is used for determining and its corresponding sample set according to described identity numbering.
Method for abstracting web page information of the invention process and device, automatically analyze the page info of waiting to extract webpage, determine the identity label of waiting to extract webpage accordingly, thereby in sample database, search and this identity label corresponding sample set, and will wait that the DOM structure that extracts the webpage parsing is complementary with the DOM sample that sample set comprises one by one, find after the coupling DOM sample, just can utilize and treat that the position location of extraction information in coupling DOM sample treat the node at extraction information place in the DOM structure, and then utilize the node of orienting to obtain the extraction information for the treatment of.DOM sample among the present invention has stability and general applicability, can farthest reduce the structure of web page variation to the influence of information extraction process, realizes the reliable of info web and accurately extraction.
Description of drawings
In order to be illustrated more clearly in the embodiment of the present application or technical scheme of the prior art, to do to introduce simply to the accompanying drawing of required use in embodiment or the description of the Prior Art below, apparently, the accompanying drawing that describes below only is some embodiment that put down in writing among the application, for those of ordinary skills, can also obtain other accompanying drawing according to these accompanying drawings.
Fig. 1 is the process flow diagram of creating sample database among the present invention;
Fig. 2 is the process flow diagram of method for abstracting web page information of the present invention;
Fig. 3 is the process flow diagram that step 202 is searched sample set among the present invention;
Fig. 4 is the synoptic diagram of Web page information extraction device embodiment 1 of the present invention;
Fig. 5 is the synoptic diagram of Web page information extraction device embodiment 2 of the present invention;
Fig. 6 is the synoptic diagram of DOM sample extraction unit among the present invention;
Fig. 7 is the synoptic diagram of searching the unit among the present invention.
Embodiment
In order to make those skilled in the art person understand the present invention program better, below in conjunction with drawings and embodiments the embodiment of the invention is described in further detail.
Following application scenarios of the present invention and the preceding preliminary work of information extraction down introduced earlier.
Web page information extraction is an important process of search engine page analysis, the user's interest web page contents extracted be organized into structural data, be conducive to the more effective index of search engine and search and webpage, provided by the invention is exactly a kind of automatically reliable Web page information extraction scheme.
For a website, its HTML(Hyper Text Markup Language, HTML (Hypertext Markup Language) is for a kind of SGML of describing web document) structure of webpage is not all by the human-edited, mainly is to lean on website establishment instrument and template code to build.So, these all should have unified pattern by the Website page that code generates, but as based on Discuz(being the PHP network forum program of a free download), PHPWind(be one based on forum's program of PHP and MySQL) etc. the forum of building, perhaps be a kind of blog platform of the PHP of use language development based on WordPress() the individual blog of building etc.
That is to say, for a website, its structure of web page has very strong systematicness, main web page contents all should have unified DOM(Document Object Model, DOM Document Object Model) tree path, as long as find the rule in this path, just can get access to directly and accurately and want the content item that extracts.
Consider this point, the present invention program should carry out the DOM path analysis to web-site earlier, extracts the DOM sample of different clauses and subclauses, as analyzing the DOM path of all models of ends of the earth forum, extracts the DOM sample at model.
Simultaneously, after extracting the DOM sample of different clauses and subclauses, in order to realize the automatic of info web and in accurate extraction, also should to utilize the DOM sample that extracts to create a sample database, in order to when needed, utilize DOM sample matches webpage to be extracted in the sample database.
Referring to Fig. 1, show the process flow diagram that the present invention creates sample database, can comprise:
The DOM structure that step 101, analyzing web page parse, and extract the DOM sample of described DOM structure correspondence.
Before extracting the DOM sample, utilize the achievement instrument that the website webpage is converted to the analytical form of dom tree earlier, the DOM structure separated out of analytical solution extracts the DOM sample accordingly again.
Generally speaking, we can extract the DOM sample according to the user's interest field, be example with the DOM sample that extracts ends of the earth forum model still, field can be presented as model content, model title, post the time, can extract three DOM samples at least at these three fields (perhaps can be understood as three clauses and subclauses).This mainly be because, generally speaking, each field corresponding a path of describing their consistance rules, but under special circumstances, may not have compatibility between many DOM paths at same field, at this moment, a field just may at least two DOM paths of correspondence.
A kind of implementation as this step, can extract the DOM sample of DOM structure correspondence in the following manner, detailed process is: judge between two DOM structures whether have compatibility at least, if have, then utilize asterisk wildcard that each self-corresponding DOM sample of described at least two DOM structures is synthesized one; If do not have, then extract the DOM sample of each DOM structure correspondence.
DOM sample with the title that extracts the Sina News page is example, news pages with Sina resolves to the DOM structure earlier, if two DOM structures at headline of current existence, whether this two DOM structures between can compatible: if can not be compatible if then judging, then according to two DOM samples of two DOM structure extraction, that is to say, want these two samples just can cover all news category pages of Sina when extracting headline.If can be compatible, though for example the second level node of two DOM structures is inequality, but has compatibility, so available asterisk wildcard is (as asterisk *, question mark or other self-defined symbol, the present invention can not do restriction) replacement second level node, thereby go out a DOM sample according to these two DOM structure extraction, for the follow-up information extraction that blurs.
Step 102 is that key word is organized into sample set with the DOM sample that extracts with the identity label of described webpage.
Because sample database of the present invention is not only at some substations point of some websites or some websites, so after step 101 is analyzed the DOM sample extract at different field, also should utilize the identity label of webpage that the DOM sample that extracts is organized into sample set.
Wherein, identity label can be site name, and as site names such as Sina, Sohu, Tengxuns, what then comprised by its sample set that is made into as groups of keywords is all DOM samples that extract from this website.Perhaps, identity label can also be substation point title, as the following title that belongs to the substation point of certain website such as news, physical culture, amusement, then all the DOM samples that extract from this substation point exactly that comprised by its sample set that is made into as groups of keywords.Perhaps, identity label can also be the signature of building the instrument maker such as Discuz, PHPWind, WordPress etc., is exactly the DOM sample that utilizes website that these makers build or substation point to have with its sample set that is made into as groups of keywords like this.
Step 103 is set up the corresponding relation between described sample set and described identity label, and described sample set and described corresponding relation are saved in the described sample database.
After step 102, can set up identity label and be corresponding relation between the sample set of key organization with it, and this corresponding relation and sample set be saved in the sample database, so that when needs extract information, can from the notebook data storehouse, search the sample information that needs, carry out the DOM route matching.
Need to prove, the sample set that is kept in the database can sort successively according to the time of its warehouse-in, also can sort according to different user demands, as calculate the credible mark of each sample set, and sort by credible mark, the present invention can not do restriction to this.
Below to the information extraction process of the present invention explanation that makes an explanation.
Referring to Fig. 2, show the process flow diagram of method for abstracting web page information of the present invention, can comprise:
Step 201 is according to waiting that the page info that extracts webpage determines the described identity label of waiting to extract webpage.
Step 202 is searched the described identity label corresponding sample set of waiting to extract webpage in sample database, described sample set comprises at least one DOM Document Object Model DOM sample.
Automatically analyze the page info of waiting to extract webpage, as webpage URL(Uniform Resource Locator, URL(uniform resource locator)), Meta information (can provide the metamessage meta-information of the relevant page, as obtain Discuz version information etc.) etc. page info, determine to wait to extract the identity label of webpage, so, can use for the subsequent samples matching process by this identity label corresponding sample set of sample data library lookup.
The identity label of determining as basis " sina.com " is exactly Sina's website, the sample set that find from sample database according to this identity label this moment is exactly the DOM sample that all pages of Sina website have, these DOM samples cover all substation point pages of Sina website, also cover all clauses and subclauses that all substation points comprise simultaneously, and the different rule that has of each clauses and subclauses, i.e. DOM path.
The identity label of determining as basis " news.sina.com.cn " is exactly the news substation point of Sina, the sample set that find from sample database according to this identity label this moment is exactly the DOM sample that the news category substation point page has, these DOM samples cover all clauses and subclauses that news category substation point comprises, and the different rule that has of each clauses and subclauses, i.e. DOM path.
Can the division title of site name, substation point title even more refinement be arranged to identity label according to the use needs of reality, and be key organization's its corresponding sample set with it, so dirigibility that just can improve the present invention program.When needs extract comparatively comprehensively information, then can use as identity label dividing thicker site name etc., and at needs with short decimation in time more accurately during information, then can be referred to as identity label and use dividing thinner name, can not do restriction to this present invention.
Step 203 choose one as current DOM sample from described at least one DOM sample, and the DOM structure of utilizing described current DOM sample and webpage described to be extracted to parse is complementary.
After finding the set of identity label corresponding sample, the DOM sample that can this set comprises is road sign, begin to carry out match search from the root node of the DOM structure of waiting to extract webpage, judge wait to extract webpage the DOM structure whether with sample set in the DOM sample be complementary (match condition of aspects such as main dactylus point level and node location), and carry out Web page information extraction according to matching result.
Need to prove, because include at least one DOM sample in the sample set, in order successfully to acquire the extraction information for the treatment of, can with the DOM structure of waiting to extract webpage one by one with sample set in each DOM sample mate, until finding the DOM sample that conforms to it, perhaps till judging each DOM sample standard deviation it fails to match.
Below in conjunction with step 204, the 205 both of these case explanation that makes an explanation respectively.
Step 204, if the match is successful, then basis is treated position location described the treat extraction information of extraction information in described current DOM sample at the node of described DOM structure, and utilizes described node to obtain the described extraction information for the treatment of.
If the current DOM sample of determining can be complementary with the DOM structure that webpage to be extracted parses, then finish the sample matches process, the start information extraction process from sample set:
At first, obtain and treat the position of extraction information in current DOM sample, treat namely extraction information is positioned on which node of current DOM sample (how many nodes each DOM sample comprises, each node is positioned at which level and each node is represented which aspect information etc., all can know) in preliminary preparation; Secondly, treat the node at extraction information place in waiting to extract the DOM structure of webpage according to this location, position, and record node or the groups of nodes oriented; At last, obtain the extraction information for the treatment of to the position at the node place of orienting, so far just finished Web page information extraction process of the present invention.Need to prove, get access to according to the node oriented treat extraction information after, but the aftertreatment function logic of load operation person's configuration is also carried out customization to the extraction information of obtaining for the treatment of, as filter the impurity such as advertisement that comprise in the literal, to obtain the required text message of user.
Step 205 if it fails to match, then continue to be carried out the described step of choosing current DOM sample, returns the extraction failure until each DOM sample standard deviation after it fails to match.
If wait to extract DOM structure and the failure of current DOM sample matches of webpage, as the two from root node, can't mate (also namely in the DOM structure that this webpage is set up, can not find the node that corresponding DOM sample path indicates) on the third level node, then judge this matching process failure, can check whether also there is the remaining DOM of coupling sample in the sample set this moment:
If there is no, then determination information extracts failure, can return the extraction failure to the operator;
If exist, then determine one the DOM sample as current DOM sample from remaining the coupling, continue to carry out above-mentioned matching process, after so handling, a kind of in following two kinds of situations may occur: the match is successful, the node of output location, and stop matching process; Perhaps it fails to match, redefines a current DOM sample again, and failure is not extracted in output when no longer residue is mated the DOM sample.
The present invention program just can carry out Web page information extraction automatically reliably, in addition, because in a single day build successfully the website, can not revise web site architecture easily generally speaking, so the DOM sample in the sample database of the present invention's extraction just have stability and general applicability.In addition, the mode of utilizing asterisk wildcard or reducing some label makes the DOM sample can support the general grammer of trying one's best, and farthest reduces structure of web page and changes influence to the DOM sample, realizes the reliable of info web and accurately extracts.In addition, relative prior art, the DOM sample extraction cycle of the present invention is shorter, and the storage redundancy amount is few, and has general applicability at the website of supporting same architecture, can extract the information of these websites simultaneously.
Need to prove, sample database among the present invention can be according to user's demand, bring in constant renewal in the quantity of information of its preservation of expansion, but so just may influence above step 202 basis and wait to extract the process that the webpage identity label is searched sample set, therefore, the present invention also provides a kind of preferred version of searching sample set.In this scheme, also need to set up a concordance list:
At first calculate the cryptographic hash that is kept at each identity label in the database, carry out Hash operation as the title to Sina's website, obtain corresponding cryptographic hash; Set up the corresponding relation between this cryptographic hash and this identity label identity numbering in sample database then, and this corresponding relation is saved in the concordance list.
Corresponding to this, the optimal way that step 202 is searched sample set can embody as follows, referring to process flow diagram shown in Figure 3, can comprise:
Step 301 is calculated the described cryptographic hash of waiting to extract the identity label of webpage.
Step 302 is searched default Hash table, the described identity numbering of waiting to extract the identity label of webpage of determining described cryptographic hash correspondence.
Step 303 is determined and its corresponding sample set according to described identity numbering.
Determine wait to extract the identity label of webpage after, identity label with the database preservation compares no longer one by one, but the cryptographic hash of this identity label is calculated by elder generation, and determine the identity numbering of this identity label in database by the mode of looking into Hash table (also namely above concordance list), and then find this identity numbering corresponding sample set, and utilize it to carry out follow-up matching process.So, compare the mode that the set of identity label corresponding sample is searched in comparison one by one, this optimal way just can improve seek rate greatly, thereby improves the efficient of Web page information extraction of the present invention.
Correspondingly, the present invention also provides a kind of Web page information extraction device, referring to Fig. 4, shows the synoptic diagram of Web page information extraction device embodiment 1, can comprise:
Identity label determining unit 401 is used for according to waiting that the page info that extracts webpage determines the described identity label of waiting to extract webpage;
Search unit 402, be used for searching the described identity label corresponding sample set of waiting to extract webpage at sample database, described sample set comprises at least one DOM Document Object Model DOM sample;
Choose unit 403, be used for choosing one as current DOM sample from described at least one DOM sample;
Matching unit 404, the DOM structure that is used for utilizing described current DOM sample and webpage described to be extracted to parse is complementary:
Positioning unit 405 is used at described matching unit when the match is successful, according to treating position location described the treat extraction information of extraction information in described current DOM sample at the node of described DOM structure, and utilizes described node to obtain the described extraction information for the treatment of;
Notification unit 406 is used for when it fails to match, notifying the described unit of choosing to continue to choose current DOM sample at described matching unit, returns the extraction failure until each DOM sample standard deviation after it fails to match.
Accurately extract in order to utilize the present invention program to carry out the reliable of info web, the Web page information extraction device also should have the function of creating sample database.Referring to Fig. 5, show the synoptic diagram of Web page information extraction device embodiment 2, the above device also comprises on the basis of Fig. 4:
DOM sample extraction unit 501 is used for the DOM structure that analyzing web page parses, and extracts the DOM sample of described DOM structure correspondence;
Organization unit 502, the identity label that is used for described webpage is that key word is organized into sample set with the DOM sample that extracts;
Preserve unit 503, be used for setting up the corresponding relation between described sample set and described identity label, and described sample set and described corresponding relation are saved in the described sample database.
A kind of implementation as DOM sample extraction unit can be presented as synoptic diagram shown in Figure 6, comprising:
Judging unit 601 is used for judging between two DOM structures whether have compatibility at least;
Synthesis unit 602 is used for utilizing asterisk wildcard that each self-corresponding DOM sample of described at least two DOM structures is synthesized one when described at least two the DOM structures of described judgment unit judges have compatibility;
Extract subelement 603, be used for when described at least two the DOM structures of described judgment unit judges do not have compatibility, extracting the DOM sample of each DOM structure correspondence.
By above to the introduction of method embodiment as can be known, if to preserve too much data volume in the database, may to influence the speed that sample set is searched in the unit of searching, in order addressing this problem, a concordance list be can set up, the cryptographic hash of identity label and the corresponding relation between its identity numbering preserved.Accordingly, search the unit and can be presented as synoptic diagram shown in Figure 7, comprising:
Computing unit 701 is used for calculating the cryptographic hash that the identity label of webpage is extracted in definite the waiting of described identity label determining unit;
Identity numbering determining unit 702 is used for searching default Hash table, the described identity numbering of waiting to extract the identity label of webpage of determining described cryptographic hash correspondence;
Sample set determining unit 703 is used for determining and its corresponding sample set according to described identity numbering.
The above only is preferred embodiment of the present invention, is not the present invention is done any pro forma restriction.Though the present invention discloses as above with preferred embodiment, yet is not in order to limit the present invention.Any those of ordinary skill in the art, do not breaking away under the technical solution of the present invention scope situation, all can utilize method and the technology contents of above-mentioned announcement that technical solution of the present invention is made many possible changes and modification, or be revised as the equivalent embodiment of equivalent variations.Therefore, every content that does not break away from technical solution of the present invention according to any simple modification, equivalent variations and the modification that technical spirit of the present invention is done above embodiment, all still belongs in the scope of technical solution of the present invention protection.

Claims (9)

1. a method for abstracting web page information is characterized in that, described method comprises:
Determine the described identity label of waiting to extract webpage according to waiting the page info that extracts webpage;
Search the described identity label corresponding sample set of waiting to extract webpage in sample database, described sample set comprises at least one DOM Document Object Model DOM sample;
From described at least one DOM sample, choose one as current DOM sample, and the DOM structure of utilizing described current DOM sample and webpage described to be extracted to parse is complementary:
If the match is successful, then basis is treated position location described the treat extraction information of extraction information in described current DOM sample at the node of described DOM structure, and utilizes described node to obtain the described extraction information for the treatment of;
If it fails to match, then continue to carry out the described step of choosing current DOM sample, return the extraction failure until each DOM sample standard deviation after it fails to match.
2. method according to claim 1 is characterized in that, the mode of creating described sample database is:
The DOM structure that analyzing web page parses, and extract the DOM sample of described DOM structure correspondence;
Identity label with described webpage is that key word is organized into sample set with the DOM sample that extracts;
Set up the corresponding relation between described sample set and described identity label, and described sample set and described corresponding relation are saved in the described sample database.
3. method according to claim 2 is characterized in that, the DOM sample of the described DOM structure of described extraction correspondence comprises:
Judge between two DOM structures whether have compatibility at least, if having, then utilize asterisk wildcard that each self-corresponding DOM sample of described at least two DOM structures is synthesized one; If do not have, then extract the DOM sample of each DOM structure correspondence.
4. method according to claim 2 is characterized in that,
Described identity label is site name, substation point title or the signature of building the instrument maker.
5. according to each described method of claim 1-4, it is characterized in that,
Describedly in sample database, search the described identity label corresponding sample set of waiting to extract webpage, comprising:
Calculate the described cryptographic hash of waiting to extract the identity label of webpage;
Search default Hash table, the described identity numbering of waiting to extract the identity label of webpage of determining described cryptographic hash correspondence;
Determine and its corresponding sample set according to described identity numbering.
6. a Web page information extraction device is characterized in that, described device comprises:
The identity label determining unit is used for according to waiting that the page info that extracts webpage determines the described identity label of waiting to extract webpage;
Search the unit, be used for searching the described identity label corresponding sample set of waiting to extract webpage at sample database, described sample set comprises at least one DOM Document Object Model DOM sample;
Choose the unit, be used for choosing one as current DOM sample from described at least one DOM sample;
Matching unit, the DOM structure that is used for utilizing described current DOM sample and webpage described to be extracted to parse is complementary:
Positioning unit is used at described matching unit when the match is successful, according to treating position location described the treat extraction information of extraction information in described current DOM sample at the node of described DOM structure, and utilizes described node to obtain the described extraction information for the treatment of;
Notification unit is used for when it fails to match, notifying the described unit of choosing to continue to choose current DOM sample at described matching unit, returns the extraction failure until each DOM sample standard deviation after it fails to match.
7. device according to claim 6 is characterized in that, described device also comprises:
DOM sample extraction unit is used for the DOM structure that analyzing web page parses, and extracts the DOM sample of described DOM structure correspondence;
Organization unit, the identity label that is used for described webpage is that key word is organized into sample set with the DOM sample that extracts;
Preserve the unit, be used for setting up the corresponding relation between described sample set and described identity label, and described sample set and described corresponding relation are saved in the described sample database.
8. device according to claim 7 is characterized in that, described DOM sample extraction unit comprises:
Judging unit is used for judging between two DOM structures whether have compatibility at least;
Synthesis unit is used for utilizing asterisk wildcard that each self-corresponding DOM sample of described at least two DOM structures is synthesized one when described at least two the DOM structures of described judgment unit judges have compatibility;
Extract subelement, be used for when described at least two the DOM structures of described judgment unit judges do not have compatibility, extracting the DOM sample of each DOM structure correspondence.
9. according to each described device of claim 6-8, it is characterized in that the described unit of searching comprises:
Computing unit is used for calculating the cryptographic hash that the identity label of webpage is extracted in definite the waiting of described identity label determining unit;
Identity numbering determining unit is used for searching default Hash table, the described identity numbering of waiting to extract the identity label of webpage of determining described cryptographic hash correspondence;
The sample set determining unit is used for determining and its corresponding sample set according to described identity numbering.
CN2013103202797A 2013-07-26 2013-07-26 Method and device for extracting webpage information Pending CN103345532A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2013103202797A CN103345532A (en) 2013-07-26 2013-07-26 Method and device for extracting webpage information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2013103202797A CN103345532A (en) 2013-07-26 2013-07-26 Method and device for extracting webpage information

Publications (1)

Publication Number Publication Date
CN103345532A true CN103345532A (en) 2013-10-09

Family

ID=49280327

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2013103202797A Pending CN103345532A (en) 2013-07-26 2013-07-26 Method and device for extracting webpage information

Country Status (1)

Country Link
CN (1) CN103345532A (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105630839A (en) * 2014-11-07 2016-06-01 阿里巴巴集团控股有限公司 Webpage information acquisition method and device
CN106033468A (en) * 2015-03-20 2016-10-19 腾讯科技(深圳)有限公司 Webpage content extracting method, device and system
CN106951451A (en) * 2017-02-22 2017-07-14 北京麒麟合盛网络技术有限公司 A kind of webpage content extracting method, device and computing device
CN106960058A (en) * 2017-04-05 2017-07-18 金电联行(北京)信息技术有限公司 A kind of structure of web page alteration detection method and system
CN107239520A (en) * 2017-05-25 2017-10-10 东北大学 A kind of universal forum context extraction method
CN108009171A (en) * 2016-10-27 2018-05-08 腾讯科技(北京)有限公司 A kind of method and apparatus for extracting content-data
CN109101491A (en) * 2018-07-24 2018-12-28 湖南星汉数智科技有限公司 A kind of author information abstracting method, device, computer installation and computer readable storage medium
CN109948095A (en) * 2017-11-27 2019-06-28 腾讯科技(深圳)有限公司 Show method, apparatus, terminal and the storage medium of web page contents
CN114528811A (en) * 2022-01-21 2022-05-24 北京麦克斯泰科技有限公司 Article content extraction method, device, equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070080236A1 (en) * 2005-09-29 2007-04-12 Betz Michael D Electric powertrain for work machine
CN101833554A (en) * 2009-03-09 2010-09-15 富士通株式会社 Method and equipment for producing extraction template and method and equipment for extracting content on web pages
CN101916285A (en) * 2010-08-20 2010-12-15 北京新岸线网络技术有限公司 Method and device for analyzing internet web page contents
CN101944094A (en) * 2009-07-06 2011-01-12 富士通株式会社 Webpage information extraction method and device thereof
CN101950312A (en) * 2010-08-18 2011-01-19 赵清政 Method for analyzing webpage content of internet

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070080236A1 (en) * 2005-09-29 2007-04-12 Betz Michael D Electric powertrain for work machine
CN101833554A (en) * 2009-03-09 2010-09-15 富士通株式会社 Method and equipment for producing extraction template and method and equipment for extracting content on web pages
CN101944094A (en) * 2009-07-06 2011-01-12 富士通株式会社 Webpage information extraction method and device thereof
CN101950312A (en) * 2010-08-18 2011-01-19 赵清政 Method for analyzing webpage content of internet
CN101916285A (en) * 2010-08-20 2010-12-15 北京新岸线网络技术有限公司 Method and device for analyzing internet web page contents

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105630839A (en) * 2014-11-07 2016-06-01 阿里巴巴集团控股有限公司 Webpage information acquisition method and device
CN105630839B (en) * 2014-11-07 2019-11-15 阿里巴巴集团控股有限公司 The acquisition methods and device of webpage information
CN106033468A (en) * 2015-03-20 2016-10-19 腾讯科技(深圳)有限公司 Webpage content extracting method, device and system
CN108009171B (en) * 2016-10-27 2020-06-30 腾讯科技(北京)有限公司 Method and device for extracting content data
CN108009171A (en) * 2016-10-27 2018-05-08 腾讯科技(北京)有限公司 A kind of method and apparatus for extracting content-data
CN106951451B (en) * 2017-02-22 2019-11-12 麒麟合盛网络技术股份有限公司 A kind of webpage content extracting method, device and calculate equipment
CN106951451A (en) * 2017-02-22 2017-07-14 北京麒麟合盛网络技术有限公司 A kind of webpage content extracting method, device and computing device
CN106960058B (en) * 2017-04-05 2021-01-12 金电联行(北京)信息技术有限公司 Webpage structure change detection method and system
CN106960058A (en) * 2017-04-05 2017-07-18 金电联行(北京)信息技术有限公司 A kind of structure of web page alteration detection method and system
CN107239520A (en) * 2017-05-25 2017-10-10 东北大学 A kind of universal forum context extraction method
CN107239520B (en) * 2017-05-25 2020-07-03 东北大学 General forum text extraction method
CN109948095A (en) * 2017-11-27 2019-06-28 腾讯科技(深圳)有限公司 Show method, apparatus, terminal and the storage medium of web page contents
CN109948095B (en) * 2017-11-27 2022-09-30 腾讯科技(深圳)有限公司 Method, device, terminal and storage medium for displaying webpage content
CN109101491A (en) * 2018-07-24 2018-12-28 湖南星汉数智科技有限公司 A kind of author information abstracting method, device, computer installation and computer readable storage medium
CN109101491B (en) * 2018-07-24 2021-12-17 湖南星汉数智科技有限公司 Author information extraction method and device, computer device and computer readable storage medium
CN114528811A (en) * 2022-01-21 2022-05-24 北京麦克斯泰科技有限公司 Article content extraction method, device, equipment and storage medium
CN114528811B (en) * 2022-01-21 2022-09-02 北京麦克斯泰科技有限公司 Article content extraction method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN103345532A (en) Method and device for extracting webpage information
US9619448B2 (en) Automated document revision markup and change control
Kovbasistyi et al. Method for detection of non-relevant and wrong information based on content analysis of web resources
US11263062B2 (en) API mashup exploration and recommendation
US20090019015A1 (en) Mathematical expression structured language object search system and search method
CN104185845A (en) System and method for providing a binary representation of a web page
CN101655862A (en) Method and device for searching information object
CN103838796A (en) Webpage structured information extraction method
CN108021598B (en) Page extraction template matching method and device and server
CN110007906B (en) Script file processing method and device and server
CN102651002A (en) Webpage information extracting method and system
CN101950312A (en) Method for analyzing webpage content of internet
CN104063401A (en) Webpage style address merging method and device
CN102662966A (en) Method and system for obtaining subject-oriented dynamic page content
CN107862039B (en) Webpage data acquisition method and system and data matching and pushing method
CN104462532A (en) Method and device for extracting webpage text
CN102646124A (en) Method for automatically identifying address information
CN111625748A (en) Website navigation bar information extraction method and device, electronic equipment and storage medium
CN105204806A (en) Individual display method and device for mobile terminal webpage
CN104915438A (en) Method for acquiring PCU association data in specific topic microblogs
CN111061975B (en) Method and device for processing irrelevant content in page
CN106326314B (en) Webpage information extraction method and device
CN109948015B (en) Meta search list result extraction method and system
CN111966930B (en) Webpage list analyzing method and system based on XPath sequence
CN112230989B (en) Webpage channel navigation bar extraction method, system, electronic equipment and storage medium

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20131009