CN102760150A - Webpage extraction method based on attribute reproduction and labeled path - Google Patents
Webpage extraction method based on attribute reproduction and labeled path Download PDFInfo
- Publication number
- CN102760150A CN102760150A CN2012100971675A CN201210097167A CN102760150A CN 102760150 A CN102760150 A CN 102760150A CN 2012100971675 A CN2012100971675 A CN 2012100971675A CN 201210097167 A CN201210097167 A CN 201210097167A CN 102760150 A CN102760150 A CN 102760150A
- Authority
- CN
- China
- Prior art keywords
- attribute
- name
- property value
- webpage
- value
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Abstract
The invention discloses a webpage extraction method based on attribute reproduction and labeled path. The web extraction method comprises the following steps of: constructing an attribute value seed set through extracting a target website or an attribute value list page, wherein part value of a target attribute is contained; acquiring a partial sample page, and determining a relative labeled path, between an attribute name and an attribute value, of each attribute; downloading a partial page, constructing a training sample base, and storing the acquired codes in a local database; inquiring and labeling all reproductions of each seed attribute value in the training webpage, recording to the labeled path corresponding to each reproduction; taking the labeled path with highest support to a same attribute webpage as an extraction rule for extracting other webpage information except the training samples; accessing other webpage HTML (Hypertext Markup Language) trees in the target website by using the acquired labeled path, locating the label where the attribute value is, and extracting a text character string; and deleting the attribute value without the attribute name or with an incorrect attribute name, and storing the correct attribute value into the local database, thereby finishing the attribute value extraction of page attribute.
Description
Technical field
The present invention relates to a kind of web page extraction method based on attribute reproduction and tag path, particularly reappear less and website that the attribute reproduction is more to entities such as the communities that increases income, a kind of tradition that is different from is surveyed and the web page extraction method based on the template of reproduction entity.
Background technology
One of key effect of Internet is a data exhibiting.It is comprising the information that the entity by every field constitutes.At this, entity refers to the object instance in certain website data model, and usually corresponding to a webpage, like an electronic product, the project or the like of increasing income.Extract this category information and important value is arranged for web application such as making up contrast formula online shopping and vertical search engine.
Different web sites in the same field often has identical data.For instance, the user can find the information about a iPod in apple.com, and these information also appear among the amazon.com.Usually, can the data reproduction in the webpage be divided into two types according to granularity: one type another kind of in the attribute rank in entity level.At this, we are regarded as the set of attribute with entity, and each attribute by its name-value to forming.The reproduction of entity level refers to that some data of different web sites are meant some conceptual entities.Like the top a kind of reproduction that comes to this about the example of iPod.Simultaneously, a kind of more common situation has been described in the reproduction of attribute level, the part attribute that promptly all occurs in two or the above webpage.Own attribute (' operating system ', ' Android ') together like the HTC h710e among SAMSUNG S5830 mobile phone and the htc.com among the amazon.com, although these two products are different entities.From the above, the entity reproduction is a kind of special case that attribute reappears.
The data reproduction phenomenon has been brought new opportunities and challenges to information extraction technique.The data that repeat become the common sample drawn in isomery website virtually; As long as knowing the fraction repeating data in advance just can mark the fraction page of any website wherein; And then excavate decimation rule with the mode of supervised learning, accomplish information extraction to other pages of whole website.Yet how to obtain repeating data in advance, how to utilize them that webpage is marked automatically and how carry out problem values such as rule digging and must further investigate marking webpage.
Some experiments have before been verified the validity of using entity level replay method through extracting website, restaurant and bibliography website.Yet the entity level is reappeared actually rare in some field, as project entity in the community that increases income and the individual subscriber brief introduction in the social networks.For electronic product, the product of a brand often is present in each online shop, and social network sites seldom has the user profile of repetition.Simultaneously, next project of increasing income of generalized case only is present in the community that increases income, and certain project only just can appear in a plurality of communities that increase income under two kinds of situation: 1. project is moved 2. project mirror images.Project when migration, the passing that the information of same project also can be in time in two communities and become inconsistent, and the project mirror image only appears on the ripe project of minority, the project of increasing income of most incubation periods does not have mirror image.To sum up, it is actually rare in the community that increases income to reappear entity, yet fortunately is that we find that it is ubiquitous that the attribute level is reappeared.For instance, the increase income licence of project of the difference in the different communities possibly all be that " GPL " or programming language all are " C++ ".Our method is exactly to utilize such attribute to reappear to extract.
In addition, in abstracting method, how carrying out abstract to web page template also is the major issue of web page extraction.Therefore the abstracting method that does not specifically provide the web page template mathematical model is difficult to realize that the part abstracting method is each character string that webpage removes back-end data with template definition, has but ignored the tree structure of html web page, effective locating web-pages content.
Summary of the invention
The problem that the present invention will solve is: the reproduction entity to existing web page extraction technology runs into is abundant inadequately; The abstract problem such as effective inadequately of template; Propose a kind of more effective and general method for abstracting web page information, promptly webpage is extracted based on attribute reproduction and tag path.Technical scheme of the present invention may further comprise the steps:
Step 2, extract relative tag path.Obtaining targeted website part sample page, use the HTML analytical tool, is input with attribute-name and property value thereof, searches its corresponding label node respectively, the relative tag path of each attribute between attribute-name and value in the extracting objects website.
Step 3, structure training sample database.Use web crawlers to download the part webpage in the targeted website, sample number is stored in local data base greater than a preset value with the html source code that obtains.
Step 4, attribute labeling.Seed property value according in the seed set is used the similar coupling of character string to training sample database, searches and marks each seed property value all in the training webpage and reappear, and writes down each corresponding tag path that reappears.
Step 5, tag path are chosen.Same attribute is chosen the maximum tag path of occurrence number, as the decimation rule that extracts outer other info web of training sample.
Step 6, attribute location and extraction.Use the tag path that is obtained, other webpages HTML tree from root node access destination website separately, the label at property value place, location extracts the text-string that wherein comprises.
Step 7, attribute-name checking.By the relative tag path of attribute-name-property value; The corresponding attribute-name of property value in the obtaining step 6; And adopt string matching and true attribute-name to compare; Deletion does not have attribute-name or has the property value of wrong community name, stores correct property value into local data base, and the property value of accomplishing page properties extracts.
What further, said step 4 was concrete may further comprise the steps:
Step 401, the similar coupling of character string.Convert two character strings of participating in coupling into unified small letter form, and further generate its q-gram set, wherein q is a positive integer.Jakarta (Jaccard) coefficient that calculates the corresponding q-gram set of two character strings is as two string matching number of degrees values, and this numerical value is then thought coupling if be higher than predefined certain threshold value.
Step 402, attribute labeling and tag path record.According to the result of the similar coupling of character string, mark a seed property value all in the training webpage and reappear, the corresponding tag path of the each reproduction of record.
According to the method for the invention, the attribute that can effectively confirm to reappear in the webpage and corresponding tag path, thus accomplish the extraction of splitting the source item home tip.
Description of drawings
Fig. 1 is attribute-name and the example of property value in the sourceforge.net of community that increases income among the present invention;
Fig. 2 is the process flow diagram that the present invention is based on the web page extraction of attribute reproduction and tag path;
Fig. 3 extracts the item page of increasing income among the sourceforge.net of community that increases income at the increase income embodiment of attribute property value of theme, programming language, licence, four on platform for using the present invention.
Embodiment
As shown in Figure 1, for open source software obtains the process flow diagram of realizing with search system and method, the practical implementation following steps:
The heuristic search is supported in increasing website, has combined in the process of search on the net to inquire about and browse.Compare with typical keyword search, heuristic is searched for a kind of layering and the selection of browsing multidimensional to the user is provided, particularly for the user who does not have clear and definite ferret out, the heuristic search provide a kind of while searching for the mode of clear and definite demand.On the heuristic searched page; Usually all can be displayed by the attribute of search entities with the mode of list of hyperlinks, as shown in Figure 1, the attribute list in the sourceforge.net website; Mainly comprise Categories, Platform, Dev Status, Programming Language and five attributes of License; What be positioned at attribute-name below is the enumerable property value of each attribute, like the possible value of subject attribute " Software Development ", " Internet " etc. is arranged.Utilize the tabulation method for digging just can these attribute extractions be come out.So the step of structure seed community set is: at first, specify the heuristic searched page; Then, the tabulation in the excavation page; The 3rd, select attribute list.If property value seldom the time, can adopt the mode of manual observation.For example for community set A={ ' programming language ' }; Can excavate through tabulation and construct seed property value set SA={ (' programming language ', { ' Ruby ', ' JavaScript '; ' Java ', ' Java Script ' ... ..}) }.
Step 2, extract relative tag path.Obtain targeted website part sample page, use the HTML analytical tool, the relative tag path of each attribute between attribute-name and value in the extracting objects website.
For example, use HTML analytical tool HTMLparser and DOM4J, through label lookup function wherein; Attribute-name and property value thereof to occur in the page code are input; Search its corresponding label node respectively,, obtain the tag path between two nodes by the tag path function of instrument.Because the webpage unification is generated by template, a spot of generally speaking sample page (<10) just can be determined fixing relative tag path.
As extract the relative tag path of each attribute of project homepage between attribute-name and value of increasing income among the sourceforge.net, for some this webpage A.html, following text fragments is arranged wherein,
Analytical tool can be searched according to character string content " Programming Language " and " Delphi/Kylix "; And obtain separately the string tag node (text tag node) at place; Tag path function by instrument; With two string tag nodes is input, can obtain from ' Delphi/Kylix " to " Programming Language " relative tag path " text () A () ".Wherein left slash and right slash are represented the different directions on limit in the HTML tree respectively.
Step 3, structure training sample database.Use web crawlers to download the part webpage in the targeted website.For guaranteeing the validity of training, make up sample number as far as possible and gather greater than 1000 training, html source code is stored in local data base.
Step 4, attribute labeling.Seed property value according in the seed set is used the similar coupling of character string to training sample database, searches and marks each seed property value all in the training webpage and reappear, and writes down each corresponding tag path that reappears.
More particularly, can carry out through following two steps.
Step 401, the similar coupling of character string.Convert two character strings of participating in coupling into unified small letter form, and further generate its q-gram set, wherein q is a positive integer.3-gram set like " Windows XP " is { ' ##w ', ' #wi ', ' win ', ' ind ', ' ndo ', ' dow ', ' ows ', ' ws# ', ' s## ', ' ##x ', ' #xp ', ' xp# ', ' p## ' }.Jakarta (Jaccard) coefficient that calculates the corresponding q-gram set of two character strings is as two string matching number of degrees values.This numerical value is then thought coupling if be higher than predefined certain threshold value.
Step 402, attribute labeling and tag path record.Through the similar matching algorithm of the character string in the applying step 401, mark a seed property value all in the training webpage and reappear, the corresponding tag path of the each reproduction of record.
For example, be provided with seed property value " Kylix/Delphi ", can hit string tag node " Delphi/Kylix ", be regarded as the once reproduction of seed property value " Kylix/Delphi " through Jakarta coefficient similarity of character string algorithm based on q-gram.Through the tag path function of instrument, can know that the tag path at " Delphi/Kylix " place is "/HTML/BODY/DIV/SECTION/ASIDE/SECTION/SECTION/SECTION/A/te xt () ", and it is carried out record.
Step 5, tag path are chosen.Same attribute is chosen the maximum tag path of occurrence number, as the decimation rule that extracts outer other info web of training sample.
In an embodiment, same attribute is chosen at the tag path of all webpage supports the highest (occurrence number is maximum) of sourceforge.net, as extracting the increase income decimation rule of attribute of other webpage of sourceforge.net.Such as; Programming language property value among the sourceforge.net is (like " C++ "; " Perl " etc.) tag path "/HTML/BODY/DIV/SECTION/ASIDE/SECTION/SECTION/SECTION/A/te xt () " and its support of repeatedly being matched are the highest, so this path will be selected as the extraction path of programming language attribute.
Step 6, attribute location and extraction.The tag path that applying step 5 is obtained, other webpages HTML tree from root node access destination website separately, the label at property value place, location extracts the text-string that wherein comprises.
For example, through other project webpage HTML trees of increasing income that the tag path of having chosen is visited sourceforge.net, the text label at property value place, location, and extract the text.In the present embodiment; For programming language attribute " Programming Language "; Be located at step 5 and obtained the maximum tag path of occurrence number "/HTML/BODY/DIV/SECTION/ASIDE/SECTION/SECTION/SECTION/A/te xt () "; Wait to extract the project homepage page B.html of item B of increasing income for one; With this path is input, through the dissection process function of analytical tool to tag path, just can directly obtain to meet among the B.html text label node " web-based " and " Perl " in this path; Wherein because can not uniquely the locating of tag path, the wrong decimation value " web-based " that brings will be deleted at next step.
Step 7, attribute-name are verified and are gone puppet.By the relative tag path of attribute-name-property value, the corresponding attribute-name of property value in the obtaining step 6, and adopt string matching and true attribute-name to compare, deletion does not have attribute-name or has the property value of wrong community name.With going the remaining correct property value in pseudo-back to store local data base into, the property value of accomplishing a page attribute extracts.
Sourceforge.net is " text () A () " at the relative tag path of attribute-name-property value of licence in the present embodiment; " Perl " of the item B of obtaining along this route inspection step 6 of increasing income; The attribute-name that property values such as " web-based " is corresponding; Adopt string matching, find that " Perl " has correct attribute-name " Programming Language ", with it as extracting the result; And " web-based " is corresponding to correct attribute-name, then this property value is deleted.At last the correct item page file of increasing income, attribute-name, property value tlv triple (B.html, " Programming Language ", " Perl ") are stored in the database, accomplish extraction B.html programming language attribute.
Above embodiment can reflect that the present invention can be based on the attribute and the more efficient generation Page template of tag path that reappear, and the attribute of increasing income on the item page homepage through extraction extracts info web.
It should be noted last that; Above embodiment is only unrestricted in order to technical scheme of the present invention to be described; Although the present invention is specified with reference to preferred embodiment; Those of ordinary skill in the art should be appreciated that and can make amendment or be equal to replacement technical scheme of the present invention, and do not break away from the spirit and the scope of technical scheme of the present invention.
Claims (3)
1. the web page extraction method method based on attribute reproduction and tag path comprises the following steps:
Step 1, build seed set,, make up the set of property value seed, comprised the part value of objective attribute target attribute in the set through the list of attribute values page in extracting objects website or other website of same domain.
Step 2, extract relative tag path; Obtaining targeted website part sample page, use the HTML analytical tool, is input with attribute-name and property value thereof; Search its corresponding label node respectively, the relative tag path of each attribute between attribute-name and value in the extracting objects website;
Step 3, structure training sample database use web crawlers to download the part webpage in the targeted website, and sample number is stored in local data base greater than a preset value with the html source code that obtains.
Step 4, attribute labeling are used the similar coupling of character string according to the seed property value in the seed set to training sample database, search and mark each seed property value all in the training webpage and reappear, and write down each corresponding tag path that reappears.
Step 5, tag path are chosen, to the highest tag path of same attribute webpage support, as the decimation rule that extracts outer other info web of training sample;
Step 6, attribute location and extraction.Use the tag path that is obtained, other webpages HTML tree from root node access destination website separately, the label at property value place, location extracts the text-string that wherein comprises;
Step 7, attribute-name checking; By the relative tag path of attribute-name-property value; The corresponding attribute-name of property value in the obtaining step 6, and adopt string matching and true attribute-name to compare, deletion does not have attribute-name or has the property value of wrong community name; Store correct property value into local data base, the property value of accomplishing page properties extracts.
2. the method for claim 1, wherein said step 4 further comprises:
Step 401, the similar coupling of character string; Convert two character strings of participating in coupling into unified small letter form; And further generate its q-gram set; Wherein q is a positive integer, and Jakarta (Jaccard) coefficient that calculates the corresponding q-gram set of two character strings is as two string matching number of degrees values, and this numerical value is then thought coupling if be higher than predefined certain threshold value;
Step 402, attribute labeling and tag path record according to the result of the similar coupling of character string, mark a seed property value all in the training webpage and reappear, the corresponding tag path of the each reproduction of record.
3. the method for claim 1, wherein said webpage support is the highest to be meant that same attribute is maximum at the number of times that this position occurs.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2012100971675A CN102760150A (en) | 2012-04-05 | 2012-04-05 | Webpage extraction method based on attribute reproduction and labeled path |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2012100971675A CN102760150A (en) | 2012-04-05 | 2012-04-05 | Webpage extraction method based on attribute reproduction and labeled path |
Publications (1)
Publication Number | Publication Date |
---|---|
CN102760150A true CN102760150A (en) | 2012-10-31 |
Family
ID=47054608
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN2012100971675A Pending CN102760150A (en) | 2012-04-05 | 2012-04-05 | Webpage extraction method based on attribute reproduction and labeled path |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN102760150A (en) |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103246732A (en) * | 2013-05-10 | 2013-08-14 | 合肥工业大学 | Online Web news content extracting method and system |
CN104866509A (en) * | 2014-02-26 | 2015-08-26 | 阿里巴巴集团控股有限公司 | Page element positioning method and device |
CN105630941A (en) * | 2015-12-23 | 2016-06-01 | 成都电科心通捷信科技有限公司 | Statistics and webpage structure based Wen body text content extraction method |
CN106227770A (en) * | 2016-07-14 | 2016-12-14 | 杭州安恒信息技术有限公司 | A kind of intelligentized news web page information extraction method |
CN106547761A (en) * | 2015-09-18 | 2017-03-29 | 北京国双科技有限公司 | Data processing method and device |
CN107679103A (en) * | 2017-09-08 | 2018-02-09 | 口碑(上海)信息技术有限公司 | For entity attributes analysis method and system |
CN108334560A (en) * | 2018-01-03 | 2018-07-27 | 腾讯科技(深圳)有限公司 | A kind of information acquisition method and relevant device |
CN109783728A (en) * | 2018-12-29 | 2019-05-21 | 安徽听见科技有限公司 | Page crawler rule update method and system |
US20190303501A1 (en) * | 2018-03-27 | 2019-10-03 | International Business Machines Corporation | Self-adaptive web crawling and text extraction |
CN111339457A (en) * | 2018-12-18 | 2020-06-26 | 富士通株式会社 | Method and apparatus for extracting information from web page and storage medium |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101582075A (en) * | 2009-06-24 | 2009-11-18 | 大连海事大学 | Web information extraction system |
-
2012
- 2012-04-05 CN CN2012100971675A patent/CN102760150A/en active Pending
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101582075A (en) * | 2009-06-24 | 2009-11-18 | 大连海事大学 | Web information extraction system |
Non-Patent Citations (3)
Title |
---|
HONGZHI WANG 等: "《XCpags:Compression of XML Document with XPath Query Support》", 《PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON INFORMATION TECHNOLOGY:CODING AND COMPUTING(ITCC"04)》, 31 December 2004 (2004-12-31), pages 1 - 5 * |
YANXU ZHU 等: "《Exploiting Attribute Redundancy for Web Entity Data Extraction》", 《ICADL 2011》, 31 December 2011 (2011-12-31), pages 98 - 107 * |
刘云峰: "《一种基于标签路径聚类的文本信息抽取算法》", 《计算机应用与软件》, vol. 27, no. 11, 30 November 2010 (2010-11-30), pages 199 - 202 * |
Cited By (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103246732B (en) * | 2013-05-10 | 2016-02-24 | 合肥工业大学 | A kind of abstracting method of online Web news content and system |
CN103246732A (en) * | 2013-05-10 | 2013-08-14 | 合肥工业大学 | Online Web news content extracting method and system |
CN104866509A (en) * | 2014-02-26 | 2015-08-26 | 阿里巴巴集团控股有限公司 | Page element positioning method and device |
CN106547761A (en) * | 2015-09-18 | 2017-03-29 | 北京国双科技有限公司 | Data processing method and device |
CN106547761B (en) * | 2015-09-18 | 2020-01-07 | 北京国双科技有限公司 | Data processing method and device |
CN105630941B (en) * | 2015-12-23 | 2018-11-06 | 成都云数未来信息科学有限公司 | Web body matter abstracting methods based on statistics and structure of web page |
CN105630941A (en) * | 2015-12-23 | 2016-06-01 | 成都电科心通捷信科技有限公司 | Statistics and webpage structure based Wen body text content extraction method |
CN106227770A (en) * | 2016-07-14 | 2016-12-14 | 杭州安恒信息技术有限公司 | A kind of intelligentized news web page information extraction method |
CN106227770B (en) * | 2016-07-14 | 2019-06-21 | 杭州安恒信息技术股份有限公司 | A kind of intelligentized news web page information extraction method |
CN107679103A (en) * | 2017-09-08 | 2018-02-09 | 口碑(上海)信息技术有限公司 | For entity attributes analysis method and system |
CN108334560A (en) * | 2018-01-03 | 2018-07-27 | 腾讯科技(深圳)有限公司 | A kind of information acquisition method and relevant device |
US20190303501A1 (en) * | 2018-03-27 | 2019-10-03 | International Business Machines Corporation | Self-adaptive web crawling and text extraction |
US10922366B2 (en) * | 2018-03-27 | 2021-02-16 | International Business Machines Corporation | Self-adaptive web crawling and text extraction |
CN111339457A (en) * | 2018-12-18 | 2020-06-26 | 富士通株式会社 | Method and apparatus for extracting information from web page and storage medium |
CN111339457B (en) * | 2018-12-18 | 2023-09-08 | 富士通株式会社 | Method and apparatus for extracting information from web page and storage medium |
CN109783728A (en) * | 2018-12-29 | 2019-05-21 | 安徽听见科技有限公司 | Page crawler rule update method and system |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN102760150A (en) | Webpage extraction method based on attribute reproduction and labeled path | |
US11294968B2 (en) | Combining website characteristics in an automatically generated website | |
US10698960B2 (en) | Content validation and coding for search engine optimization | |
Zheng et al. | A Study of Web Information Extraction Technology Based on Beautiful Soup. | |
CN109033358B (en) | Method for associating news aggregation with intelligent entity | |
CN1934569B (en) | Search systems and methods with integration of user annotations | |
US9734149B2 (en) | Clustering repetitive structure of asynchronous web application content | |
CN108090104B (en) | Method and device for acquiring webpage information | |
CN102982117B (en) | Information search method and device | |
US20170109442A1 (en) | Customizing a website string content specific to an industry | |
CN103294732A (en) | Web page crawling method and spider | |
CN102982118A (en) | Searching method and device based on favorites | |
US20220292160A1 (en) | Automated system and method for creating structured data objects for a media-based electronic document | |
CN106874502A (en) | A kind of method of video search, device and terminal | |
CN101894109A (en) | Database building method and device | |
Ghobadi et al. | An ontology based semantic extraction approach for B2C eCommerce | |
JPWO2003060764A1 (en) | Information retrieval system | |
CN105912573A (en) | Data updating method and data updating device | |
Viljanen et al. | Publishing and using ontologies as mashup services | |
JP5380874B2 (en) | Information retrieval method, program and apparatus | |
Mfenyana et al. | Development of a Facebook crawler for opinion trend monitoring and analysis purposes: case study of government service delivery in Dwesa | |
US9530094B2 (en) | Jabba-type contextual tagger | |
CN104504069A (en) | Building method and device for file index | |
CN104504070A (en) | Search method and device | |
Pan et al. | Automatically maintaining navigation sequences for querying semi-structured web sources |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C02 | Deemed withdrawal of patent application after publication (patent law 2001) | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20121031 |