CN102760150A

CN102760150A - Webpage extraction method based on attribute reproduction and labeled path

Info

Publication number: CN102760150A
Application number: CN2012100971675A
Authority: CN
Inventors: 尹刚; 王怀民; 李翔; 朱沿旭; 史殿习; 王涛; 袁霖; 余跃
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2012-04-05
Filing date: 2012-04-05
Publication date: 2012-10-31

Abstract

The invention discloses a webpage extraction method based on attribute reproduction and labeled path. The web extraction method comprises the following steps of: constructing an attribute value seed set through extracting a target website or an attribute value list page, wherein part value of a target attribute is contained; acquiring a partial sample page, and determining a relative labeled path, between an attribute name and an attribute value, of each attribute; downloading a partial page, constructing a training sample base, and storing the acquired codes in a local database; inquiring and labeling all reproductions of each seed attribute value in the training webpage, recording to the labeled path corresponding to each reproduction; taking the labeled path with highest support to a same attribute webpage as an extraction rule for extracting other webpage information except the training samples; accessing other webpage HTML (Hypertext Markup Language) trees in the target website by using the acquired labeled path, locating the label where the attribute value is, and extracting a text character string; and deleting the attribute value without the attribute name or with an incorrect attribute name, and storing the correct attribute value into the local database, thereby finishing the attribute value extraction of page attribute.

Description

Web page extraction method based on attribute reproduction and tag path

Technical field

The present invention relates to a kind of web page extraction method based on attribute reproduction and tag path, particularly reappear less and website that the attribute reproduction is more to entities such as the communities that increases income, a kind of tradition that is different from is surveyed and the web page extraction method based on the template of reproduction entity.

Background technology

One of key effect of Internet is a data exhibiting.It is comprising the information that the entity by every field constitutes.At this, entity refers to the object instance in certain website data model, and usually corresponding to a webpage, like an electronic product, the project or the like of increasing income.Extract this category information and important value is arranged for web application such as making up contrast formula online shopping and vertical search engine.

Different web sites in the same field often has identical data.For instance, the user can find the information about a iPod in apple.com, and these information also appear among the amazon.com.Usually, can the data reproduction in the webpage be divided into two types according to granularity: one type another kind of in the attribute rank in entity level.At this, we are regarded as the set of attribute with entity, and each attribute by its name-value to forming.The reproduction of entity level refers to that some data of different web sites are meant some conceptual entities.Like the top a kind of reproduction that comes to this about the example of iPod.Simultaneously, a kind of more common situation has been described in the reproduction of attribute level, the part attribute that promptly all occurs in two or the above webpage.Own attribute (' operating system ', ' Android ') together like the HTC h710e among SAMSUNG S5830 mobile phone and the htc.com among the amazon.com, although these two products are different entities.From the above, the entity reproduction is a kind of special case that attribute reappears.

The data reproduction phenomenon has been brought new opportunities and challenges to information extraction technique.The data that repeat become the common sample drawn in isomery website virtually; As long as knowing the fraction repeating data in advance just can mark the fraction page of any website wherein; And then excavate decimation rule with the mode of supervised learning, accomplish information extraction to other pages of whole website.Yet how to obtain repeating data in advance, how to utilize them that webpage is marked automatically and how carry out problem values such as rule digging and must further investigate marking webpage.

Some experiments have before been verified the validity of using entity level replay method through extracting website, restaurant and bibliography website.Yet the entity level is reappeared actually rare in some field, as project entity in the community that increases income and the individual subscriber brief introduction in the social networks.For electronic product, the product of a brand often is present in each online shop, and social network sites seldom has the user profile of repetition.Simultaneously, next project of increasing income of generalized case only is present in the community that increases income, and certain project only just can appear in a plurality of communities that increase income under two kinds of situation: 1. project is moved 2. project mirror images.Project when migration, the passing that the information of same project also can be in time in two communities and become inconsistent, and the project mirror image only appears on the ripe project of minority, the project of increasing income of most incubation periods does not have mirror image.To sum up, it is actually rare in the community that increases income to reappear entity, yet fortunately is that we find that it is ubiquitous that the attribute level is reappeared.For instance, the increase income licence of project of the difference in the different communities possibly all be that " GPL " or programming language all are " C++ ".Our method is exactly to utilize such attribute to reappear to extract.

In addition, in abstracting method, how carrying out abstract to web page template also is the major issue of web page extraction.Therefore the abstracting method that does not specifically provide the web page template mathematical model is difficult to realize that the part abstracting method is each character string that webpage removes back-end data with template definition, has but ignored the tree structure of html web page, effective locating web-pages content.

Summary of the invention

The problem that the present invention will solve is: the reproduction entity to existing web page extraction technology runs into is abundant inadequately; The abstract problem such as effective inadequately of template; Propose a kind of more effective and general method for abstracting web page information, promptly webpage is extracted based on attribute reproduction and tag path.Technical scheme of the present invention may further comprise the steps:

Step 1, build seed set.Through the list of attribute values page in extracting objects website or other website of same domain, make up the set of property value seed, comprised the part value of objective attribute target attribute in the set.

Step 2, extract relative tag path.Obtaining targeted website part sample page, use the HTML analytical tool, is input with attribute-name and property value thereof, searches its corresponding label node respectively, the relative tag path of each attribute between attribute-name and value in the extracting objects website.

Step 3, structure training sample database.Use web crawlers to download the part webpage in the targeted website, sample number is stored in local data base greater than a preset value with the html source code that obtains.

Step 4, attribute labeling.Seed property value according in the seed set is used the similar coupling of character string to training sample database, searches and marks each seed property value all in the training webpage and reappear, and writes down each corresponding tag path that reappears.

Step 5, tag path are chosen.Same attribute is chosen the maximum tag path of occurrence number, as the decimation rule that extracts outer other info web of training sample.

Step 6, attribute location and extraction.Use the tag path that is obtained, other webpages HTML tree from root node access destination website separately, the label at property value place, location extracts the text-string that wherein comprises.

Step 7, attribute-name checking.By the relative tag path of attribute-name-property value; The corresponding attribute-name of property value in the obtaining step 6; And adopt string matching and true attribute-name to compare; Deletion does not have attribute-name or has the property value of wrong community name, stores correct property value into local data base, and the property value of accomplishing page properties extracts.

What further, said step 4 was concrete may further comprise the steps:

Step 401, the similar coupling of character string.Convert two character strings of participating in coupling into unified small letter form, and further generate its q-gram set, wherein q is a positive integer.Jakarta (Jaccard) coefficient that calculates the corresponding q-gram set of two character strings is as two string matching number of degrees values, and this numerical value is then thought coupling if be higher than predefined certain threshold value.

Step 402, attribute labeling and tag path record.According to the result of the similar coupling of character string, mark a seed property value all in the training webpage and reappear, the corresponding tag path of the each reproduction of record.

According to the method for the invention, the attribute that can effectively confirm to reappear in the webpage and corresponding tag path, thus accomplish the extraction of splitting the source item home tip.

Description of drawings

Fig. 1 is attribute-name and the example of property value in the sourceforge.net of community that increases income among the present invention;

Fig. 2 is the process flow diagram that the present invention is based on the web page extraction of attribute reproduction and tag path;

Fig. 3 extracts the item page of increasing income among the sourceforge.net of community that increases income at the increase income embodiment of attribute property value of theme, programming language, licence, four on platform for using the present invention.

Embodiment

As shown in Figure 1, for open source software obtains the process flow diagram of realizing with search system and method, the practical implementation following steps:

The heuristic search is supported in increasing website, has combined in the process of search on the net to inquire about and browse.Compare with typical keyword search, heuristic is searched for a kind of layering and the selection of browsing multidimensional to the user is provided, particularly for the user who does not have clear and definite ferret out, the heuristic search provide a kind of while searching for the mode of clear and definite demand.On the heuristic searched page; Usually all can be displayed by the attribute of search entities with the mode of list of hyperlinks, as shown in Figure 1, the attribute list in the sourceforge.net website; Mainly comprise Categories, Platform, Dev Status, Programming Language and five attributes of License; What be positioned at attribute-name below is the enumerable property value of each attribute, like the possible value of subject attribute " Software Development ", " Internet " etc. is arranged.Utilize the tabulation method for digging just can these attribute extractions be come out.So the step of structure seed community set is: at first, specify the heuristic searched page; Then, the tabulation in the excavation page; The 3rd, select attribute list.If property value seldom the time, can adopt the mode of manual observation.For example for community set A={ ' programming language ' }; Can excavate through tabulation and construct seed property value set SA={ (' programming language ', { ' Ruby ', ' JavaScript '; ' Java ', ' Java Script ' ... ..}) }.

Step 2, extract relative tag path.Obtain targeted website part sample page, use the HTML analytical tool, the relative tag path of each attribute between attribute-name and value in the extracting objects website.

For example, use HTML analytical tool HTMLparser and DOM4J, through label lookup function wherein; Attribute-name and property value thereof to occur in the page code are input; Search its corresponding label node respectively,, obtain the tag path between two nodes by the tag path function of instrument.Because the webpage unification is generated by template, a spot of generally speaking sample page (＜10) just can be determined fixing relative tag path.

As extract the relative tag path of each attribute of project homepage between attribute-name and value of increasing income among the sourceforge.net, for some this webpage A.html, following text fragments is arranged wherein,

Analytical tool can be searched according to character string content " Programming Language " and " Delphi/Kylix "; And obtain separately the string tag node (text tag node) at place; Tag path function by instrument; With two string tag nodes is input, can obtain from ' Delphi/Kylix " to " Programming Language " relative tag path " text () A () ".Wherein left slash and right slash are represented the different directions on limit in the HTML tree respectively.

Step 3, structure training sample database.Use web crawlers to download the part webpage in the targeted website.For guaranteeing the validity of training, make up sample number as far as possible and gather greater than 1000 training, html source code is stored in local data base.

More particularly, can carry out through following two steps.

Step 401, the similar coupling of character string.Convert two character strings of participating in coupling into unified small letter form, and further generate its q-gram set, wherein q is a positive integer.3-gram set like " Windows XP " is { ' ##w ', ' #wi ', ' win ', ' ind ', ' ndo ', ' dow ', ' ows ', ' ws# ', ' s## ', ' ##x ', ' #xp ', ' xp# ', ' p## ' }.Jakarta (Jaccard) coefficient that calculates the corresponding q-gram set of two character strings is as two string matching number of degrees values.This numerical value is then thought coupling if be higher than predefined certain threshold value.

Step 402, attribute labeling and tag path record.Through the similar matching algorithm of the character string in the applying step 401, mark a seed property value all in the training webpage and reappear, the corresponding tag path of the each reproduction of record.

For example, be provided with seed property value " Kylix/Delphi ", can hit string tag node " Delphi/Kylix ", be regarded as the once reproduction of seed property value " Kylix/Delphi " through Jakarta coefficient similarity of character string algorithm based on q-gram.Through the tag path function of instrument, can know that the tag path at " Delphi/Kylix " place is "/HTML/BODY/DIV/SECTION/ASIDE/SECTION/SECTION/SECTION/A/te xt () ", and it is carried out record.

In an embodiment, same attribute is chosen at the tag path of all webpage supports the highest (occurrence number is maximum) of sourceforge.net, as extracting the increase income decimation rule of attribute of other webpage of sourceforge.net.Such as; Programming language property value among the sourceforge.net is (like " C++ "; " Perl " etc.) tag path "/HTML/BODY/DIV/SECTION/ASIDE/SECTION/SECTION/SECTION/A/te xt () " and its support of repeatedly being matched are the highest, so this path will be selected as the extraction path of programming language attribute.

Step 6, attribute location and extraction.The tag path that applying step 5 is obtained, other webpages HTML tree from root node access destination website separately, the label at property value place, location extracts the text-string that wherein comprises.

For example, through other project webpage HTML trees of increasing income that the tag path of having chosen is visited sourceforge.net, the text label at property value place, location, and extract the text.In the present embodiment; For programming language attribute " Programming Language "; Be located at step 5 and obtained the maximum tag path of occurrence number "/HTML/BODY/DIV/SECTION/ASIDE/SECTION/SECTION/SECTION/A/te xt () "; Wait to extract the project homepage page B.html of item B of increasing income for one; With this path is input, through the dissection process function of analytical tool to tag path, just can directly obtain to meet among the B.html text label node " web-based " and " Perl " in this path; Wherein because can not uniquely the locating of tag path, the wrong decimation value " web-based " that brings will be deleted at next step.

Step 7, attribute-name are verified and are gone puppet.By the relative tag path of attribute-name-property value, the corresponding attribute-name of property value in the obtaining step 6, and adopt string matching and true attribute-name to compare, deletion does not have attribute-name or has the property value of wrong community name.With going the remaining correct property value in pseudo-back to store local data base into, the property value of accomplishing a page attribute extracts.

Sourceforge.net is " text () A () " at the relative tag path of attribute-name-property value of licence in the present embodiment; " Perl " of the item B of obtaining along this route inspection step 6 of increasing income; The attribute-name that property values such as " web-based " is corresponding; Adopt string matching, find that " Perl " has correct attribute-name " Programming Language ", with it as extracting the result; And " web-based " is corresponding to correct attribute-name, then this property value is deleted.At last the correct item page file of increasing income, attribute-name, property value tlv triple (B.html, " Programming Language ", " Perl ") are stored in the database, accomplish extraction B.html programming language attribute.

Above embodiment can reflect that the present invention can be based on the attribute and the more efficient generation Page template of tag path that reappear, and the attribute of increasing income on the item page homepage through extraction extracts info web.

It should be noted last that; Above embodiment is only unrestricted in order to technical scheme of the present invention to be described; Although the present invention is specified with reference to preferred embodiment; Those of ordinary skill in the art should be appreciated that and can make amendment or be equal to replacement technical scheme of the present invention, and do not break away from the spirit and the scope of technical scheme of the present invention.

Claims

1. the web page extraction method method based on attribute reproduction and tag path comprises the following steps:

Step 1, build seed set,, make up the set of property value seed, comprised the part value of objective attribute target attribute in the set through the list of attribute values page in extracting objects website or other website of same domain.

Step 2, extract relative tag path; Obtaining targeted website part sample page, use the HTML analytical tool, is input with attribute-name and property value thereof; Search its corresponding label node respectively, the relative tag path of each attribute between attribute-name and value in the extracting objects website;

Step 3, structure training sample database use web crawlers to download the part webpage in the targeted website, and sample number is stored in local data base greater than a preset value with the html source code that obtains.

Step 4, attribute labeling are used the similar coupling of character string according to the seed property value in the seed set to training sample database, search and mark each seed property value all in the training webpage and reappear, and write down each corresponding tag path that reappears.

Step 5, tag path are chosen, to the highest tag path of same attribute webpage support, as the decimation rule that extracts outer other info web of training sample;

Step 6, attribute location and extraction.Use the tag path that is obtained, other webpages HTML tree from root node access destination website separately, the label at property value place, location extracts the text-string that wherein comprises;

Step 7, attribute-name checking; By the relative tag path of attribute-name-property value; The corresponding attribute-name of property value in the obtaining step 6, and adopt string matching and true attribute-name to compare, deletion does not have attribute-name or has the property value of wrong community name; Store correct property value into local data base, the property value of accomplishing page properties extracts.

2. the method for claim 1, wherein said step 4 further comprises:

Step 401, the similar coupling of character string; Convert two character strings of participating in coupling into unified small letter form; And further generate its q-gram set; Wherein q is a positive integer, and Jakarta (Jaccard) coefficient that calculates the corresponding q-gram set of two character strings is as two string matching number of degrees values, and this numerical value is then thought coupling if be higher than predefined certain threshold value;

Step 402, attribute labeling and tag path record according to the result of the similar coupling of character string, mark a seed property value all in the training webpage and reappear, the corresponding tag path of the each reproduction of record.

3. the method for claim 1, wherein said webpage support is the highest to be meant that same attribute is maximum at the number of times that this position occurs.