CN103605742A

CN103605742A - Method and device for recognizing network resource entity content page

Info

Publication number: CN103605742A
Application number: CN201310589670.7A
Authority: CN
Inventors: 崔华; 肖镜辉
Original assignee: Beijing Sogou Technology Development Co Ltd
Current assignee: Beijing Sogou Technology Development Co Ltd
Priority date: 2013-11-20
Filing date: 2013-11-20
Publication date: 2014-02-26
Anticipated expiration: 2033-11-20
Also published as: CN103605742B

Abstract

The invention discloses a method and a device for recognizing a network resource entity content page. The method includes: pointing out process information of an entity resource webpage relevant to network resource entities during acquiring a user's browse webpage; restoring entity access tracks, by a user, of the specific network resource entity according to the process information; acquiring a starting-point webpage address on the entity access tracks and determining the content page of the specific network resource entity according to the starting-point webpage address on the entity access tracks. By the method, expandability in recognizing the content page can be improved.

Description

Method and the device of recognition network resource entity catalogue page

Technical field

The present invention relates to webpage recognition technology field, be specifically related to method and the device of recognition network resource entity catalogue page.

Background technology

Web browser is for the file in display web page server or archives economy, and allows a kind of software of user and these file interactions.It can be used for being presented at word, image and other information in WWW or LAN.These words or image, can be the hyperlink that is connected to other network address, and user can browse various information by clicking the mode of various hyperlink.

In numerous abundant Internet resources, there are the special Internet resources of a class, this Internet resources Yi Ji ,Zhang ,Jie Dengwei unit, has continuity, and can periodically upgrade.For example, certain serial, upgrades two collection every day, and certain caricature upgrades weekly a collection, etc.For this Internet resources, generally each concrete entity can a corresponding catalogue page, in this catalogue page, demonstrate this entity each unit browse entrance.For example, certain entity is run after fame and is called the caricature in " the different energy of Area D field ", in the catalogue page of this caricature, can show the broadcasting entrance of this each collection of drama of caricature, this broadcasting entrance generally exists with the form of hyperlink, and as anchor text, user can, by clicking a certain broadcasting entrance, jump to concrete collection of drama and play with " the 1st collection ", " the 2nd integrates " etc.If the author of this caricature is follow-up, this caricature is upgraded, produced new collection of drama, can in the catalogue page of this caricature, show the broadcasting entrance of the collection of drama making new advances.Conventionally, the renewal that needs user initiatively to pay close attention to and search catalogue page, to get the what be new of Internet resources.

In order to save user's running cost, some browser or browser plug-in can provide for user the update notifying service of Internet resources, for example, browser can adopt the mode on backstage to monitor the update status of certain Internet resources, if there is renewal, the hyperlink of latest network resource etc. can be offered to user, user directly clicks the latest update content that this hyperlink can get Internet resources, the operation steps that the Gains resources that reduces user with this upgrades.For example, for user initiatively provides up-to-date collection of TV plays, up-to-date caricature chapters and sections etc.

In the process of the above-mentioned update status that obtains Internet resources, need to monitor the update status of the catalogue page of Internet resources entity, for the application program for monitoring, how by program, automatically being comformed and identified the catalogue page of Internet resources entity in multiple web pages, is the technical issues that need to address in implementation procedure.In prior art, the mode that generally can analyze the content of text in webpage according to the text feature of catalogue page, identifies catalogue page.For example, in catalogue page, generally comprise " the * * collection ", " the * * chapter " etc. and there are some regular texts, therefore, by judging in the content of text of webpage, whether comprise and meet these regular texts, whether just can judge a webpage is the catalogue page of certain Internet resources.But the mode of this text judgement need to be set up some rules in advance, if the text in certain webpage does not meet presetting rule, can be filtered.But in fact,, even if do not meet preset rule in the text of certain webpage, also may belong to catalogue page.Visible, the extensibility of prior art is poor.

Summary of the invention

In view of the above problems, propose the present invention to a kind of method and device of the recognition network resource entity catalogue page that overcomes the problems referred to above or address the above problem are at least in part provided, can improve the extensibility of identification catalogue page.

According to one aspect of the present invention, a kind of method of recognition network resource entity catalogue page is provided, it is characterized in that, comprising:

Obtain in user's browsing page process, point out the procedural information of the actual resource webpage relevant to Internet resources entity;

According to described procedural information, restore the entities access track that user accesses particular network resource entity;

Obtain the starting point web page address on described entities access track, according to the starting point web page address on described entities access track, determine the catalogue page of this particular network resource entity.

Alternatively, described procedural information comprises the website under described actual resource webpage, the address of described actual resource webpage, and the address of the referer while pointing out described actual resource webpage;

Describedly according to described procedural information, restore the entities access track that user accesses particular network resource entity, comprising:

According to Internet resources entity corresponding to described actual resource webpage and affiliated website, described actual resource webpage is divided into a plurality of subsets; Wherein, in each subset, comprise a plurality of actual resource webpages relevant to consolidated network resource entity under same website;

In same subset, according to the address of the address of each actual resource webpage and described referer, restore user and access the entities access track to map network resource entity under corresponding website;

The described starting point web page address obtaining on described entities access track, comprising:

On a described entities access track, the address of other actual resource webpages on the referer address that contrast target entity resource webpage is corresponding and this entities access track, if the address of other actual resource webpages of referer address and any one that target entity resource webpage is corresponding is identical, this actual resource webpage is defined as to the non-starting point webpage on entities access track, and this actual resource webpage is deleted from described access track;

Repeat previous step, until it is identical with the address of other actual resource webpages to no longer include the referer address that any actual resource webpage is corresponding on this entities access track;

Referer corresponding to remaining actual resource webpage on this entities access track is defined as to the starting point webpage on described entities access track.

Alternatively, describedly according to Internet resources entity corresponding to described actual resource webpage and affiliated website, described actual resource webpage is divided into a plurality of subsets, comprises:

With the physical name of the Internet resources entity that obtains in advance, adopt the method for long coupling to mate the title of described actual resource webpage, according to the result of coupling, described actual resource webpage is divided into a plurality of subsets.

Alternatively, described in obtain the starting point web page address on described entities access track, according to the starting point web page address on described entities access track, determine the catalogue page of this particular network resource entity, comprising:

Obtain more than two a plurality of starting point webpages on entities access tracks corresponding to consolidated network resource entity of same website;

Add up respectively each starting point webpage occurrence number in described a plurality of starting point webpage, and occurrence number is met to the starting point webpage of prerequisite, be defined as corresponding particular network resource entity at the catalogue page of corresponding website.

Alternatively, also comprise:

After getting described a plurality of starting point webpage, judge that whether described a plurality of starting point webpage is relevant to the consolidated network resource entity of described same website, and by the filtering of incoherent starting point webpage.

Alternatively, described in obtain in user's browsing page process, point out the procedural information of the actual resource webpage relevant to Internet resources entity, comprising:

Obtain the address of the webpage of pointing out in the process of user's browsing page, and the address of the referer corresponding with pointed out webpage;

With the physical name obtaining in advance, and/or actual resource address, user is pointed out to the address of webpage, and the address of described referer is filtered, obtain in the address of address that user points out and described referer and described physical name, and/or the address that matches, described actual resource address.

Alternatively, also comprise, obtain in the following manner in advance described actual resource address:

HTML (Hypertext Markup Language) html tag code according to the hyperlink in the known navigation page, extracts described actual resource address;

And/or,

From user's web page storage folder, obtain comprise particular keywords address as described actual resource address;

And/or,

Judge in the directory name of user's webpage collection whether comprise particular keywords, the address of extracting in catalogue if comprise is as described actual resource address;

And/or,

Obtain the site address of particular keywords in the title of website homepage as described actual resource address.

Alternatively, obtain in the following manner in advance described physical name:

Capture the anchor text of hyperlink in known Internet resources entity index page;

Described anchor text is carried out to noise reduction filtration, from described anchor text, extract described physical name.

According to a further aspect in the invention, provide a kind of device of recognition network resource entity catalogue page, it is characterized in that, having comprised:

Procedural information acquiring unit, for obtaining user's browsing page process, points out the procedural information of the actual resource webpage relevant to Internet resources entity;

Access track reduction unit, for restoring the entities access track that user accesses particular network resource entity according to described procedural information;

Catalogue page acquiring unit, for obtaining the starting point web page address on described entities access track, according to the starting point web page address on described entities access track, determines the catalogue page of this particular network resource entity.

Described access track reduction unit, comprising:

Subset division subelement, for according to Internet resources entity corresponding to described actual resource webpage and affiliated website, is divided into a plurality of subsets by described actual resource webpage; Wherein, in each subset, comprise a plurality of actual resource webpages relevant to consolidated network resource entity under same website;

Access track is atomic unit also, in same subset, according to the address of the address of each actual resource webpage and described referer, restores user and accesses the entities access track to map network resource entity under corresponding website;

Described catalogue page acquiring unit, comprising:

Subelement is deleted in contrast, be used at a described entities access track, the address of other actual resource webpages on the referer address that contrast target entity resource webpage is corresponding and this entities access track, if the address of other actual resource webpages of referer address and any one that target entity resource webpage is corresponding is identical, this actual resource webpage is defined as to the non-starting point webpage on entities access track, and this actual resource webpage is deleted from described access track;

Cycle control subelement, repeats for controlling described contrast deletion subelement, until it is identical with the address of other actual resource webpages to no longer include the referer address that any actual resource webpage is corresponding on this entities access track;

Starting point webpage is determined subelement, for referer corresponding to remaining actual resource webpage on this entities access track is defined as to the starting point webpage on described entities access track.

Alternatively, described subset division subelement, specifically for:

Alternatively, described catalogue page acquiring unit, specifically for:

Alternatively, also comprise:

Starting point home page filter unit, for after getting described a plurality of starting point webpage, judges that whether described a plurality of starting point webpage is relevant to the consolidated network resource entity of described same website, and by the filtering of incoherent starting point webpage.

Alternatively, described procedural information acquiring unit, comprising:

User clicks address acquisition subelement, for obtaining the address of the webpage that process is pointed out of user's browsing page, and the address of the referer corresponding with pointed out webpage;

User clicks address filtering subelement, for the physical name to obtain in advance, and/or actual resource address, user is pointed out to the address of webpage, and filter the address of described referer, obtain in the address of address that user points out and described referer and described physical name, and/or the address that matches, described actual resource address.

Alternatively, also comprise actual resource address acquisition unit, for obtaining in the following manner described actual resource address:

And/or,

Alternatively, also comprise physical name acquiring unit, for obtaining in the following manner described physical name:

According to method and the device of the recognition network resource entity catalogue page of the embodiment of the present invention, can be in user's browsing page process, the procedural information of pointing out the actual resource webpage relevant to Internet resources entity obtains, according to this procedural information also original subscriber access the entities access track of particular network resource entity, and therefrom determine the starting point web page address of entities access track, finally according to this starting point web page address, determine the catalogue page of Internet resources entity, the embodiment of the present invention provides a kind of practicable implementation for how by the program catalogue page that identifies Internet resources entity in multiple web pages of automatically comforming, the process of identification does not need to depend on the text feature of catalogue page etc., catalogue page identification efficiently and accurately to Internet resources entity, there is stronger extensibility.

Further, the procedural information of the access entity resource webpage obtaining, can from accessing the process of browsing page, the whole network user obtain, again according to the starting point web page address on entities access track, while determining the catalogue page of this particular network resource entity, can consider the whole network user's visit data statistics, for a plurality of starting point webpages corresponding to the consolidated network resource entity of the same website getting, according to the statistics of the whole network, screen, thereby the starting point webpage that meets screening conditions is defined as to corresponding particular network resource entity at the catalogue page of corresponding website, further improved the accuracy of definite catalogue page method.

Above-mentioned explanation is only the general introduction of technical solution of the present invention, in order to better understand technological means of the present invention, and can be implemented according to the content of instructions, and for above and other objects of the present invention, feature and advantage can be become apparent, below especially exemplified by the specific embodiment of the present invention.

Accompanying drawing explanation

In order to be illustrated more clearly in the embodiment of the present invention or technical scheme of the prior art, to the accompanying drawing of required use in embodiment be briefly described below, apparently, accompanying drawing in the following describes is only some embodiments of the present invention, for those of ordinary skills, do not paying under the prerequisite of creative work, can also obtain according to these accompanying drawings other accompanying drawing.In the accompanying drawings:

Fig. 1 is the method flow diagram of recognition network resource entity catalogue page according to an embodiment of the invention;

Fig. 2 is the device schematic diagram of recognition network resource entity catalogue page according to an embodiment of the invention.

Embodiment

Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is clearly and completely described, obviously, described embodiment is only the present invention's part embodiment, rather than whole embodiment.Embodiment based in the present invention, the every other embodiment that those of ordinary skills obtain, belongs to the scope of protection of the invention.

In embodiments of the present invention, be according to the webpage redirect path producing in user's accessed web page process, carry out the catalogue page of discovering network resource entity.Its principle is: user is in the process of accesses network resource entity; conventionally can point out concrete chapters and sections page from catalogue page; and seldom can turn back to from concrete chapters and sections page catalogue page; according to this feature, just can to the access path of an Internet resources entity, obtain by statistical study user the catalogue page of Internet resources entity.Below concrete implementation procedure is at length introduced.

Referring to Fig. 1, first the embodiment of the present invention provides a kind of method of recognition network resource entity catalogue page, and the method can comprise the following steps:

S110: obtain in user's browsing page process, point out the procedural information of the actual resource webpage relevant to Internet resources entity;

Generally, user, use in the process of the various webpages of browser access, browser or its plug-in unit can be with form recording users such as travel log the access situation to webpage.For example, user is after certain open any browser, first in a Shipping Options Page, opened certain webpage A, afterwards by certain link in webpage clicking A, opened certain webpage B, etc., browser or its plug-in unit just can be recorded respectively the relevant information of webpage that browse with user address, web page title of each webpage of access etc., in addition, for this webpage of opening by the mode of link redirect of similar webpage B, the link redirect that can also record this webpage and be from which webpage (being called referer) is come, etc.Like this, just can record the webpage of each mistake accessed by the user, and the information such as URL address of this webpage, title, referer.For convenience of description, the webpage of mistake accessed by the user can be regarded as to point one by one, certainly, the webpage now getting is all some discrete points, that is to say, now, just known which webpage user browsed, and some information (address that comprises webpage of each webpage, title, referer address etc.), but the relation between each point is also that the clicked sequencing of each webpage etc. is not yet known, in follow-up step, need user's access track to reduce, to each point is formed to track according to clicking operation relation, and therefrom find out the start page on entities access track.In the webpage of browsing user, wherein a part is relevant to Internet resources entity, only has this part webpage is extracted, and just can judge the catalogue page which or which webpage is Internet resources entity.

Specifically, when extracting the webpage relevant to particular network resource entity, can there is various ways.For example, wherein under a kind of mode, can obtain in advance the physical name of some Internet resources entities, form physical name set, after getting the information of the webpage that user accessed, can take out the web page title on each aspect, if in title, exist with physical name set in the information that matches of a certain physical name, this point is exactly the point relevant to accesses network resource entity.And then just this point can be extracted.Certainly, in the webpage of being accessed by a user, may there are a plurality of different Internet resources entities, now, can extract respectively respectively corresponding which point of each different physical name, and then when reduction entities access track, be all also for same entity, to reduce respectively.

For example, the webpage that user accessed has A, B, C, D, E, F, through judgement, finds, all contains certain physical name in the title of B wherein, C, D, can be using B, C, D as the point relevant to this Internet resources entity.If contain another physical name in E wherein, F, can be using E, F as the point relevant to this another Internet resources entity, etc.In a word, for each physical name, can count the set that a plurality of points form, these points may, not recording with in a navigation process, be added up but can combine to be combined together.

Wherein, in order to judge in the title of each point whether comprise physical name, can set up in advance a physical name set, wherein comprise a plurality of physical names.Like this, just can directly judge in the title of each point, whether exist with physical name set in the content that matches of information, if existed, can prove in the title of this point and comprise physical name.When setting up physical name set, the mode that can adopt can be the anchor text that captures hyperlink in known Internet resources entity index page, and anchor text is carried out to noise reduction filtration, from anchor text, extracts physical name.That is to say, in webpage, the link of Internet resources entity generally can be usingd physical name as anchor text.For example, have the link of a lot of TV play in certain navigation webpage, each link is here all generally directly the name of TV play to be referred to as to anchor text, so that user's identification, this navigation webpage is commonly referred to as the index page of Internet resources entity.After user clicks the link of certain TV play in index page, just can enter the catalogue page of this TV play, wherein show the broadcasting entrance that respectively collects TV play.Therefore,, by extracting the anchor text of this link, just can get physical name.That is to say, can capture the content of some specific Internet resources entity index pages, and therefrom extract the anchor text comprising, the anchor text then these being grabbed collects as the physical name of Internet resources entity.Like this, by collecting a large amount of physical names, by after the processing such as duplicate removal, noise reduction, just can set up physical name set, for follow-up user data is screened, or each address on webpage access track be screened.Concrete, the title of the webpage that the physical name getting and user are pointed out, and the title of referer mates, if the title of the webpage that user points out, or in the title of referer, contain certain physical name, think the address relevant with Internet resources entity and retain, otherwise by this address filtering.

User is when accessing the actual resource webpage relevant to Internet resources entity, can record equally the address of the actual resource webpage that Internet resources entity is relevant, the address of corresponding referer, in addition, can also obtain the information such as website described in actual resource webpage, and then can utilize these information, the process of customer access network resource entity is reduced.Point out in the process of procedural information of actual resource webpage obtaining user, can first obtain the address of the webpage of pointing out in the process of user's browsing page, and the address of the referer corresponding with pointed out webpage; And then the physical name to obtain in advance, and/or actual resource address, user is pointed out to the address of webpage, and the address of referer is filtered, obtain in address that user points out and described referer and described physical name, and/or the address that matches, actual resource address.

Such as, can obtain in advance the actual resource address relevant to Internet resources entity.The method of specifically obtaining can be any one or a few the combination in following mode:

(1) according to the HTML (Hypertext Markup Language) html tag code of the hyperlink in the known navigation page, extract actual resource address, because having in the page of navigation feature, some generally can carry out taxonomic revision to various webpages, therefore, just can be according to the HTML(Hypertext Markup Language of the navigation page of known particular category, HTML (Hypertext Markup Language)) label, the banner of extraction particular category.For example, address for the webpage of novel class, can be by capturing in the html of " http: // 123.sogou.com/xiaoshuo/ " (the novel class navigation page of search dog), all url under label <a> are as the address of novel class.Certainly, for same classification, can also from a plurality of different navigation pages, capture, and carry out duplicate removal processing, finally obtain the address of novel class webpage;

(2) from user's web page storage folder, obtain comprise particular keywords address as actual resource address; As the network address of being collected in user's collection, its title has comprised " caricature " word, using this network address as an actual resource address; User's web page storage folder can comprise local collection, network profile;

(3) judge in the directory name of user's webpage collection whether comprise particular keywords, the address of extracting in catalogue if comprise is as described actual resource address; For example in the directory name of user's collection, include " caricature " word, using the collection network address under this catalogue as actual resource address.User's web page storage folder can comprise local collection, network profile etc.Some users are when using the collection of browser, the collection of generally webpage of collection can being classified, also be in collection, generally to have a plurality of catalogues, each catalogue is named according to the classification of webpage, therefore the web page address that, also can be from user's collection extracts corresponding classification under the catalogue of specific names.Certainly, when specific implementation, the work of collecting various types of other web page address is generally to bring in and carry out at server, and user's collection may only be kept at terminal device this locality of user, therefore, can also, in the situation that getting user's permission, the information in user's collection be uploaded onto the server, for server analysis.Certainly, in actual applications, in order to use collection on different terminal devices, user may use network profile function, also the information of collection can synchronously be saved in server end, like this when user uses browser on different terminal devices, can carry out the synchronous of collection by the account that signs in to oneself.Therefore, in the situation that user has used network profile function, server end can directly get the information in each user's collection, and then carries out extraction and the collection of the web page address of particular category.

(4) site address that obtains particular keywords in the title of website homepage is as actual resource address; For example, in the title of website homepage, if comprise the keywords such as certain caricature, novel, can be using the address of this website as actual resource address; That is to say, if Internet resources entity and catalogue page thereof are provided in some websites,, in the title of the homepage of website, generally can comprise the keyword of particular network resource entity, the address that therefore can obtain this website is as actual resource address.

In actual application, can only use the physical name getting, or the address of pointing out webpage of actual resource address to user, and filter the address of corresponding referer, also can be simultaneously according to the physical name and the actual resource address that get, carry out dual filtration, to obtain better filter effect, reduce the data volume of the processing of subsequent step simultaneously.

S120: restore the entities access track that user accesses particular network resource entity according to described procedural information;

In getting user's browsing page process, point out after the procedural information of the actual resource webpage relevant to Internet resources entity, can according to procedural information also original subscriber access the entities access track of particular network resource entity.Wherein, procedural information can comprise the address of actual resource webpage, be equivalent to know in the webpage that user points out which webpage relevant to entity, but these relevant webpages are generally the concrete a certain concrete collection of drama of Internet resources entity or the webpage at a certain chapter content place.For example, user opens after the catalogue page A of certain entity, therefrom clicks the first collection of certain serial, has browsed the particular content of this entity the first collection in the webpage B pointing out, and now, webpage B also belongs to the webpage relevant to Internet resources entity.Therefore, next need the thing of doing to be exactly, according to these webpages relevant to Internet resources entity, determine the webpage of the catalogue page that may be Internet resources entity.For this reason, in embodiments of the present invention, entities access track in the time of can first accessing particular network resource entity to user reduces, also how judge each webpage relevant to Internet resources entity is clicked by user, and determine the starting point webpage network address on this track, the corresponding webpage of this starting point webpage network address may be just the catalogue page of Internet resources entity, certainly, may there is certain factors such as contingency in the data of unique user, last comprehensively other users' data gather judgement, therefore the catalogue page obtaining from a certain user can be temporarily as the alternative catalogue page of Internet resources entity.

Entities access track can be expressed as specific tuple-set, for example, be expressed as two tuple-sets:

{(url ₁,refer ₁),(url ₂,refer ₂)……(url _i,refer _i)…}

The address of url presentation-entity resource webpage wherein; The address of referer when this actual resource webpage is pointed out in refer representative, the subscript of the address url of actual resource webpage and referer address refer can not represent the access sequencing of webpage.In embodiments of the present invention for the ease of obtaining the catalogue page of Internet resources entity, and the catalogue page of identifying different websites, heterogeneous networks actual resource, entities access track can be expressed as specific four-tuple set, wherein, when specifically the entities access track when user is accessed to particular network resource entity reduces, can first the point of various discrete be represented by four-tuple respectively, for example, this four-tuple can be expressed as (url, refer, entity, site), wherein, the address of url presentation-entity resource webpage; The address of referer when this actual resource webpage is pointed out in refer representative; Entity represents the sign of Internet resources entity, can be the title of Internet resources entity, for example acute of serial, and the title of caricature, novel name etc.; Site represents the website under actual resource webpage.For example, certain the webpage B being accessed by certain user, in the title of webpage B, contain certain physical name M, and B is that after user has clicked certain link in A, redirect is come, in the four-tuple of B node, url is exactly the network address of B webpage itself, refer is exactly the network address of A webpage, and entity can be represented by this physical name M, as " sea thief king ", " fiery shadow person of bearing " etc., site is exactly the website under webpage B.

So by each node with after quadruple notation, can, according to Internet resources entity corresponding to actual resource webpage and affiliated website, actual resource webpage be divided into a plurality of subsets; Wherein, in each subset, comprise a plurality of actual resource webpages relevant to consolidated network resource entity under same website; In same subset, according to the address of the address of each actual resource webpage and referer, restore the entities access track that user accesses map network resource entity under corresponding website.Entities access track when user accesses consolidated network resource entity under same website can be expressed as:

Set(same(entity,site)){(url ₁,refer ₁),(url ₂,refer ₂)…..(url _i,refer _i)..}

The subscript of the address url of above-mentioned each actual resource webpage and referer address refer does not represent the order information between webpage, also descend the not access order of representative of consumer to each actual resource webpage of target order, each point of now expressing can be still discrete.

According to Internet resources entity corresponding to actual resource webpage and affiliated website, when actual resource webpage is divided into a plurality of subset, can be with the physical name of the Internet resources entity that obtains in advance, adopt the method for long coupling to mate the title of described actual resource webpage, according to the result of coupling, described actual resource webpage is divided into a plurality of subsets.For example there are two physical names of " will like " and " love is carried through to the end ", if now do not mate with the method for long coupling, while mating with " will like ", can match this two physical names, this does not conform to actual situation, and by the method for long coupling, mate the title of described actual resource webpage, the physical name like this with character relation of inclusion can well be made a distinction.

S130: obtain the starting point web page address on described entities access track, according to the starting point web page address on described entities access track, determine the catalogue page of this particular network resource entity.

Obtain the starting point web page address on entities access track, while determining the catalogue page of this particular network resource entity according to starting point web page address, specifically can be handled as follows:

On an entities access track, the address of other actual resource webpages on the referer address that contrast target entity resource webpage is corresponding and this entities access track, if the address of other actual resource webpages of referer address and any one that target entity resource webpage is corresponding is identical, this actual resource webpage is defined as to the non-starting point webpage on entities access track, and this actual resource webpage is deleted from access track;

Referer corresponding to remaining actual resource webpage on this entities access track is defined as to the starting point webpage on entities access track.Below this process is specifically described.

On an entities access track, url in two 2 tuples arbitrarily _i=refer _jtime, delete and comprise refer _jbinary pair.That is to say, suppose that the webpage relevant to certain entity has A, B, C, D, wherein, A is that certain search results pages X points out, and B is that the link from webpage A is pointed out, and C is that the link from webpage B is pointed out, and D is again that the link from webpage C is pointed out.In B wherein, C, D, all comprise same physical name, and these three webpages all belong to same website, now, (url ₁=B, refer ₁=A), (url ₂=C, refer ₂=B), (url ₃=D, refer ₃=C), visible, url ₁=refer ₂, url ₂=refer ₃therefore; B, C, D belong to the webpage on same entities access track, and the track that this user can be accessed to this Internet resources entity is reduced to A->B->C->D, and can be refer by source page ₂, refer ₃some C, D delete, webpage A corresponding to refer of the some B retaining just can be used as a starting point webpage on entities access track, and using it as an alternative catalogue page, by that analogy.Like this, same user in each webpage relevant to same entity, just may extract a plurality of alternative catalogue pages under same website.

That is to say, for same user, in its access process, can form a webpage complete or collected works I with Internet resources entity related web page, after therefrom filtering out the collections of web pages M relevant to Internet resources entity, webpage in set M can be divided into a plurality of subset Mi(wherein according to website, i=1, 2, 3 maximum occurrences is for gathering the sum of the website comprising in M), and then in each subset Mi, can be further subdivided into according to corresponding entity a plurality of subset Mij(j=1 again, 2, 3 the number of entities of maximum occurrences for comprising in set Mi), this, what in each subset Mij, comprise is exactly webpage relevant to consolidated network resource entity under same website, then, in each subset Mij, each webpage is represented with the right form of aforementioned binary, and, if the source page refer of certain binary centering is identical with the url of another binary centering, the binary that is refer by this source page is just the starting point webpage on entities access track to corresponding webpage scarcely, therefore can be by this binary to deleting, like this, finally can make to retain a part of binary pair in each subset Mij, the binary pair that this part is retained, just can be used as the alternative catalogue page of map network resource entity under corresponding website.

Certainly, in actual applications, can also there is the method in other reduction entities access path, for example, in conjunction with the access time information of each webpage, reduce, etc.

Pass through above-mentioned steps, for same user, can count the alternative catalogue page of its each Internet resources entity of accessing under each website, similarly, for other users, also can count respectively the alternative catalogue page of its each Internet resources entity of accessing under each website, like this, just the alternative catalogue page counting from these users can be gathered, and finally determine the catalogue page of Internet resources entity.For example, specifically when gathering, more than two a plurality of starting point webpages on entities access tracks corresponding to consolidated network resource entity that can obtain same website, represent by following binary each user alternative catalogue page for consolidated network resource entity under same website to group:

Set(same(entity,site)){user1(urli,referi),user2(urli,referi),user3(urli,referi)....us?ern(urli,referi)}

Then add up respectively each starting point webpage occurrence number in described a plurality of starting point webpage, by occurrence number, starting point webpage the highest or that proportion meets prerequisite in total degree is defined as the catalogue page of this Internet resources entity under this website.For example, occurrence number station total degree ratio is surpassed to 50% alternative catalogue page as the catalogue page of this Internet resources entity under this website, etc.

For example: suppose shown in following table 1 it is a browser experiences webpage that user accessed in the works:

Table 1

url	Refer
		http://www.dm5.com/m129008/	http://www.dm5.com/manhua-area-d-yinenglingyu/
http://www.dm5.com/m129008-p2/	http://www.dm5.com/m129008/
		http://news.baidu.com/	http://www.baidu.com/
http://www.dm5.com/m129008-p3/	http://www.dm5.com/m129008-p2/
		http://www.dm5.com/m129008-p4/	http://www.dm5.com/m129008-p3/
……	……

Wherein, http://news.baidu.com/ and http://www.baidu.com/ are the irrelevant webpages of Internet resources entity, therefore, it are deleted from collections of web pages.In other each webpages, http://www.dm5.com/m129008/, http://www.dm5.com/m129008-p2/, http://www.dm5.com/m129008-p3/, http://www.dm5.com/m129008-p4/ are the webpage relevant to accessing caricature entity " the different energy of Area D field ".And can be by the path restore of this entity of access:

http://www.dm5.com/manhua-area-d-yinenglingyu/

->http://www.dm5.com/m129008/

->http://www.dm5.com/m129008-p2/

->http://www.dm5.com/m129008-p3/

……

And then, can get starting point webpage " http://www.dm5.com/manhua-area-d-yinenglingyu/ " on this path as alternative catalogue page.Same, the collections of web pages that other users accessed, also all can process according to the method described above, like this for consolidated network resource entity, just can in each user's collections of web pages, count respectively the alternative catalogue page under each website, as, according to each user, to the access path of caricature entity " the different energy of Area D field ", can obtain, the alternative catalogue page of this caricature is as shown in table 2:

Table 2

http://www.dm5.com/m115189-p5/
	http://www.dm5.com/m115725/
http://www.dm5.com/manhua-area-d-yinenglingyu/
	http://www.dm5.com/manhua-area-d-yinenglingyu/
http://www.dm5.com/manhua-area-d-yinenglingyu/
	http://www.dm5.com/manhua-area-d-yinenglingyu/
http://www.dm5.com/manhua-area-d-yinenglingyu/
	http://www.dm5.com/manhua-area-d-yinenglingyu/
http://www.dm5.com/manhua-area-d-yinenglingyu/
	http://www.dm5.com/manhua-area-d-yinenglingyu/
http://www.imanhua.com/comic/3401/
	http://www.imanhua.com/comic/3401/
http://www.imanhua.com/comic/3401/
	http://www.imanhua.com/comic/3401/
http://www.imanhua.com/comic/3401/
	http://www.imanhua.com/comic/3401/
http://www.imanhua.com/comic/3401/list_66230.html
	http://www.imanhua.com/comic/3401/list_66401.html?p=17

Visible, for this caricature entity, under these two websites of www.dm5.com, www.imanhua.com, a plurality of alternative catalogue pages have been counted respectively, wherein, under website www.dm5.com, the occurrence number of http://www.dm5.com/manhua-area-d-yinenglingyu/ is 8 times, and the ading up to 10 times of alternative catalogue page under this website, therefore, http://www.dm5.com/manhua-area-d-yinenglingyu/ can be defined as to the catalogue page of this caricature entity under website www.dm5.com.Similarly, http://www.imanhua.com/comic/3401/ can be defined as to the catalogue page of this caricature entity under website www.imanhua.com.

In addition, in order further the result of judgement to be optimized, obtain alternative catalogue page the statistics from each user after, can also first to these alternative catalogue pages, filter.After getting a plurality of starting point webpages, judge that whether a plurality of starting point webpages are relevant to the consolidated network resource entity of same website, and by the filtering of incoherent starting point webpage.During specific implementation, can be with judging whether alternative catalogue page is the network address that search website is relevant, or the network address of website homepage etc., if, can determine this/a little alternative catalogue pages are uncorrelated with physical network resource, and then these are uncorrelatedly fallen to alternative catalogue page filtering.

In a word, in embodiments of the present invention, can be in user's browsing page process, the procedural information of pointing out the actual resource webpage relevant to Internet resources entity obtains, according to this procedural information also original subscriber access the entities access track of particular network resource entity, and therefrom determine the starting point web page address of entities access track, finally according to this starting point web page address, determine the catalogue page of Internet resources entity, the embodiment of the present invention provides a kind of practicable implementation for how by the program catalogue page that identifies Internet resources entity in multiple web pages of automatically comforming, the process of identification does not need to depend on the text feature of catalogue page etc., catalogue page identification efficiently and accurately to Internet resources entity, extensibility is more intense.

The method of the recognition network resource entity catalogue page providing with the embodiment of the present invention is corresponding, and the embodiment of the present invention also provides a kind of device of recognition network resource entity catalogue page, and referring to Fig. 2, this device specifically can comprise:

Procedural information acquiring unit 210, for obtaining user's browsing page process, points out the procedural information of the actual resource webpage relevant to Internet resources entity;

Access track reduction unit 220, for restore user according to procedural information, access the entities access track of particular network resource entity;

Catalogue page acquiring unit 230, for obtaining the starting point web page address on entities access track, according to the starting point web page address on entities access track, determines the catalogue page of this particular network resource entity.

Wherein, the procedural information of accessing the actual resource webpage relevant to Internet resources entity can comprise the website under actual resource webpage, the address of actual resource webpage, and the address of the referer while pointing out actual resource webpage;

Under this implementation, access track reduction unit 220 can comprise:

Subset division subelement, for according to Internet resources entity corresponding to actual resource webpage and affiliated website, is divided into a plurality of subsets by actual resource webpage; Wherein, in each subset, comprise a plurality of actual resource webpages relevant to consolidated network resource entity under same website;

Access track is atomic unit also, in same subset, according to the address of the address of each actual resource webpage and referer, restores user and accesses the entities access track to map network resource entity under corresponding website;

Catalogue page acquiring unit 230 can comprise:

Subelement is deleted in contrast, be used at an entities access track, the address of other actual resource webpages on the referer address that contrast target entity resource webpage is corresponding and this entities access track, if the address of other actual resource webpages of referer address and any one that target entity resource webpage is corresponding is identical, this actual resource webpage is defined as to the non-starting point webpage on entities access track, and this actual resource webpage is deleted from access track;

Cycle control subelement, repeats for controlling contrast deletion subelement, until it is identical with the address of other actual resource webpages to no longer include the referer address that any actual resource webpage is corresponding on this entities access track;

Starting point webpage is determined subelement, for referer corresponding to remaining actual resource webpage on this entities access track is defined as to the starting point webpage on entities access track.

Under another kind of implementation, subset division subelement, specifically can be for:

With the physical name of the Internet resources entity that obtains in advance, adopt the title of the method matching entities resource webpage of long coupling, according to the result of coupling, actual resource webpage is divided into a plurality of subsets.

Catalogue page acquiring unit 230, specifically can be for:

Add up respectively each starting point webpage occurrence number in a plurality of starting point webpages, and occurrence number is met to the starting point webpage of prerequisite, be defined as corresponding particular network resource entity at the catalogue page of corresponding website.

Under this implementation, the device of this recognition network resource entity catalogue page can also comprise:

Starting point home page filter unit, for after getting a plurality of starting point webpages, judges that whether a plurality of starting point webpages are relevant to the consolidated network resource entity of same website, and by the filtering of incoherent starting point webpage.

In addition, procedural information acquiring unit 210 can comprise:

User clicks address filtering subelement, for the physical name to obtain in advance, and/or actual resource address, user is pointed out to the address of webpage, and filter the address of referer, obtain in the address of address that user points out and referer and physical name, and/or the address that matches, actual resource address.

Under this implementation, this device can also comprise actual resource address acquisition unit, for obtaining in the following manner actual resource address:

According to the HTML (Hypertext Markup Language) html tag code of the hyperlink in the known navigation page, extract actual resource address;

And/or,

From user's web page storage folder, obtain comprise particular keywords address as actual resource address;

And/or,

Judge in the directory name of user's webpage collection whether comprise particular keywords, the address of extracting in catalogue if comprise is as actual resource address;

And/or,

Obtain the site address of particular keywords in the title of website homepage as actual resource address.

In addition, under another implementation, the device of this recognition network resource entity catalogue page can also comprise physical name acquiring unit, for obtaining in the following manner physical name:

Anchor text is carried out to noise reduction filtration, from anchor text, extract physical name.

In the said apparatus of the embodiment of the present invention, can be in user's browsing page process, the procedural information of pointing out the actual resource webpage relevant to Internet resources entity obtains, according to this procedural information also original subscriber access the entities access track of particular network resource entity, and therefrom determine the starting point web page address of entities access track, finally according to this starting point web page address, determine the catalogue page of Internet resources entity, the embodiment of the present invention provides a kind of practicable implementation for how by the program catalogue page that identifies Internet resources entity in multiple web pages of automatically comforming, the process of identification does not need to depend on the text feature of catalogue page etc., thereby can identify efficiently and accurately the catalogue page of Internet resources entity.

As seen through the above description of the embodiments, those skilled in the art can be well understood to the mode that the present invention can add essential general hardware platform by software and realizes.Understanding based on such, the part that technical scheme of the present invention contributes to prior art in essence in other words can embody with the form of software product, this computer software product can be stored in storage medium, as ROM/RAM, magnetic disc, CD etc., comprise that some instructions are with so that a computer equipment (can be personal computer, server, or the network equipment etc.) carry out the method described in some part of each embodiment of the present invention or embodiment.

Each embodiment in this instructions all adopts the mode of going forward one by one to describe, between each embodiment identical similar part mutually referring to, each embodiment stresses is the difference with other embodiment.Especially, for device or system embodiment, because it is substantially similar in appearance to embodiment of the method, so describe fairly simplely, relevant part is referring to the part explanation of embodiment of the method.Apparatus and system embodiment described above is only schematic, the wherein said unit as separating component explanation can or can not be also physically to separate, the parts that show as unit can be or can not be also physical locations, can be positioned at a place, or also can be distributed in a plurality of network element.Can select according to the actual needs some or all of module wherein to realize the object of the present embodiment scheme.Those of ordinary skills, in the situation that not paying creative work, are appreciated that and implement.

Method and device to recognition network resource entity catalogue page provided by the present invention above, be described in detail, applied specific case herein principle of the present invention and embodiment are set forth, the explanation of above embodiment is just for helping to understand method of the present invention and core concept thereof; Meanwhile, for one of ordinary skill in the art, according to thought of the present invention, all will change in specific embodiments and applications.In sum, this description should not be construed as limitation of the present invention.

Claims

1. a method for recognition network resource entity catalogue page, is characterized in that, comprising:

2. method according to claim 1, is characterized in that, described procedural information comprises the website under described actual resource webpage, the address of described actual resource webpage, and the address of the referer while pointing out described actual resource webpage;

3. method as claimed in claim 2, is characterized in that, describedly according to Internet resources entity corresponding to described actual resource webpage and affiliated website, described actual resource webpage is divided into a plurality of subsets, comprising:

4. method according to claim 1, is characterized in that, described in obtain the starting point web page address on described entities access track, according to the starting point web page address on described entities access track, determine the catalogue page of this particular network resource entity, comprising:

5. method according to claim 4, is characterized in that, also comprises:

6. according to the method described in claim 1-5 any one, it is characterized in that, described in obtain in user's browsing page process, point out the procedural information of the actual resource webpage relevant to Internet resources entity, comprising:

7. method according to claim 6, is characterized in that, also comprises, obtains in the following manner in advance described actual resource address:

And/or,

8. method according to claim 6, is characterized in that, obtains in the following manner in advance described physical name:

9. a device for recognition network resource entity catalogue page, is characterized in that, comprising:

10. device according to claim 9, is characterized in that, described procedural information comprises the website under described actual resource webpage, the address of described actual resource webpage, and the address of the referer while pointing out described actual resource webpage;

Described access track reduction unit, comprising:

Described catalogue page acquiring unit, comprising:

11. devices as claimed in claim 10, is characterized in that, described subset division subelement, specifically for:

12. devices according to claim 9, is characterized in that, described catalogue page acquiring unit, specifically for:

13. devices according to claim 12, is characterized in that, also comprise:

14. according to the device described in claim 9-13 any one, it is characterized in that, described procedural information acquiring unit, comprising:

15. devices according to claim 14, is characterized in that, also comprise actual resource address acquisition unit, for obtaining in the following manner described actual resource address:

And/or,

16. devices according to claim 14, is characterized in that, also comprise physical name acquiring unit, for obtaining in the following manner described physical name: