CN103605742B

CN103605742B - Recognize the method and device of Internet resources entity catalogue page

Info

Publication number: CN103605742B
Application number: CN201310589670.7A
Authority: CN
Inventors: 崔华; 肖镜辉
Original assignee: Beijing Sogou Technology Development Co Ltd
Current assignee: Beijing Sogou Technology Development Co Ltd
Priority date: 2013-11-20
Filing date: 2013-11-20
Publication date: 2017-07-04
Anticipated expiration: 2033-11-20
Also published as: CN103605742A

Abstract

The invention discloses the method and device of identification Internet resources entity catalogue page, wherein, methods described includes：Acquisition user is browsed during webpage, points out the procedural information of the actual resource webpage related to Internet resources entity；The entity access track that user accesses particular network resource entity is restored according to the procedural information；The starting point web page address that the entity is accessed on track is obtained, the starting point web page address on track is accessed according to the entity, determine the catalogue page of the particular network resource entity.By means of the invention it is possible to improve the scalability of identification catalogue page.

Description

Recognize the method and device of Internet resources entity catalogue page

Technical field

The present invention relates to webpage identification technology field, and in particular to the method and dress of identification Internet resources entity catalogue page Put.

Background technology

Web browser is, for showing the file in web page server or archives economy, and to make user mutual with these files A kind of dynamic software.It can be used to word, image and other information being displayed in WWW or LAN.These words Or image, the hyperlink of other network address can be attached to, user can browse various moneys by way of clicking on various hyperlink News.

In numerous abundant Internet resources, there are the special Internet resources of a class, this Internet resources Yi Ji, chapter, section etc. It is unit, with continuity, and can be periodically updated.For example, certain serial, updates two and collects daily, certain caricature, often Collection of Zhou Gengxin mono-, etc..For this Internet resources, general each specific entity can correspond to a catalogue page, in this mesh In record page, show that each unit of the entity browses entrance.For example, certain entity is entitled " the different energy fields of Area D " Caricature, then in the catalogue page of the caricature, the broadcasting entrance of each collection of drama of the caricature can be shown, this broadcasting entrance is general Exist in the form of hyperlink, and with " the 1st collection ", " the 2nd integrates " etc. as Anchor Text, user can by click on it is a certain play into Mouthful, jump to specific collection of drama and play out.If the author of the caricature is subsequently updated to the caricature, generate new Collection of drama, then can show the broadcasting entrance of new collection of drama in the catalogue page of the caricature.It is often necessary to user actively pays close attention to and looks into The renewal of catalogue page is looked for, to get the what be new of Internet resources.

In order to save the running cost of user, some browsers or browser plug-in can provide the user Internet resources Update notifying service, for example, the update status that browser can be to certain Internet resources by the way of backstage are monitored, such as Updating occurs in fruit, and user, user can be supplied to be obtained by clicking directly on the hyperlink hyperlink of latest network resource etc. The latest update content of Internet resources is got, the operating procedure of the acquisition resource updates of user is reduced with this.For example, use householder It is dynamic that newest collection of TV plays, newest caricature chapters and sections etc. are provided.

, it is necessary to the renewal of the catalogue page to Internet resources entity during the update status of above-mentioned acquisition Internet resources Situation is monitored, and for for the application program for monitoring, net is identified in multiple web pages of how being comformed automatically by program The catalogue page of network resource entity, is the technical issues that need to address in implementation process.In the prior art, typically can be according to catalogue page Text feature catalogue page is recognized to mode that the content of text in webpage is analyzed.For example, being generally comprised in catalogue page " ×× collection ", " ×× chapter " etc. with some regular texts, therefore, whether wrapped by the content of text for judging webpage Containing meeting these regular texts, it is possible to judge a webpage whether be certain Internet resources catalogue page.But, this text The mode of this judgement needs to pre-build some rules, if the text in certain webpage is unsatisfactory for presetting rule, can be filtered Fall.Even if but in fact, being unsatisfactory for preset rule in the text of certain webpage, it is also possible to belong to catalogue page.It can be seen that, existing skill The scalability of art is poor.

The content of the invention

In view of the above problems, it is proposed that the present invention so as to provide one kind overcome above mentioned problem or at least in part solve on State the method and device of the identification Internet resources entity catalogue page of problem, it is possible to increase the scalability of identification catalogue page.

According to one aspect of the present invention, there is provided a kind of method of identification Internet resources entity catalogue page, its feature exists In, including：

Acquisition user is browsed during webpage, points out the process letter of the actual resource webpage related to Internet resources entity Breath；

The entity access track that user accesses particular network resource entity is restored according to the procedural information；

The starting point web page address that the entity is accessed on track is obtained, the starting point webpage on track is accessed according to the entity Address, determines the catalogue page of the particular network resource entity.

Alternatively, the procedural information includes the website belonging to the actual resource webpage, the actual resource webpage Address, and referer when pointing out the actual resource webpage address；

It is described that the entity access track that user accesses particular network resource entity, bag are restored according to the procedural information Include：

According to the corresponding Internet resources entity of the actual resource webpage and affiliated website, by the actual resource net Page is divided into multiple subsets；Wherein, it is real comprising multiple related to consolidated network resource entity under same website in each subset Body resource webpage；

In same subset, the address of address and the referer according to each actual resource webpage restores use Entity under family access correspondence website to map network resource entity accesses track；

The starting point web page address obtained on the entity access track, including：

Accessed on track in an entity, contrast target entity resource webpage is corresponding to quote page address and the entity Access track on other actual resource webpages address, if target entity resource webpage it is corresponding quote page address with it is any one The address of individual other actual resource webpages is identical, then the actual resource webpage is defined as into the non-starting point net on entity access track Page, and the actual resource webpage is deleted from the access track；

Previous step is repeated, until the entity is accessed on track there is no the corresponding reference of any actual resource webpage Page address is identical with the address of other actual resource webpages；

The corresponding referer of remaining actual resource webpage on entity access track is defined as the entity and accesses rail Starting point webpage on mark.

Alternatively, it is described according to the corresponding Internet resources entity of the actual resource webpage and affiliated website, by institute State actual resource webpage and be divided into multiple subsets, including：

With the physical name of the advance Internet resources entity for obtaining, the actual resource net is matched using the method for matching most long The actual resource webpage is divided into multiple subsets by the title of page, the result according to matching.

Alternatively, it is described to obtain the starting point web page address that the entity is accessed on track, track is accessed according to the entity On starting point web page address, determine the catalogue page of the particular network resource entity, including：

Obtain multiple starting points that corresponding more than two entities of consolidated network resource entity of same website are accessed on track Webpage；

Each starting point webpage occurrence number in the multiple starting point webpage is counted respectively, and will appear from number of times meet preset bar The starting point webpage of part, is defined as catalogue page of the correspondence particular network resource entity in correspondence website.

Alternatively, also include：

After the multiple starting point webpage is got, judge whether the multiple starting point webpage is same with the same website One Internet resources entity is related, and incoherent starting point webpage is filtered.

Alternatively, during the acquisition user browses webpage, the actual resource net related to Internet resources entity is pointed out The procedural information of page, including：

Obtain the address that webpage is pointed out during user browses webpage, and referer corresponding with pointed out webpage Address；

With the advance physical name for obtaining, and/or actual resource address, the address of webpage is pointed out to user, and described drawn Filtered with the address of page, obtain in the address of the address pointed out of user and the referer with the physical name, and/or The address of the actual resource addresses match.

Alternatively, also include, the actual resource address is obtained beforehand through in the following manner：

The HTML html tag code of the hyperlink in known navigation page, extracts the entity money Source address；

And/or,

The address comprising particular keywords is obtained as the actual resource address from the web page storage folder of user；

And/or,

Judge whether include particular keywords in the directory name of user's webpage collection, in catalogue is extracted comprising if Address is used as the actual resource address；

And/or,

The site address of particular keywords in the title of website homepage is obtained as the actual resource address.

Alternatively, the physical name is obtained beforehand through in the following manner：

The Anchor Text of hyperlink in the known Internet resources entity index page of crawl；

Noise reduction filtering is carried out to the Anchor Text, the physical name is extracted from the Anchor Text.

According to another aspect of the present invention, there is provided a kind of device for recognizing Internet resources entity catalogue page, its feature exists In, including：

Procedural information acquiring unit, for obtaining during user browses webpage, points out related to Internet resources entity The procedural information of actual resource webpage；

Track reduction unit is accessed, particular network resource entity is accessed for restoring user according to the procedural information Entity accesses track；

Catalogue page acquiring unit, for obtaining the starting point web page address on the entity access track, according to the entity The starting point web page address on track is accessed, the catalogue page of the particular network resource entity is determined.

Access track reduction unit, including：

Subset division subelement, for according to the corresponding Internet resources entity of the actual resource webpage and affiliated station Point, multiple subsets are divided into by the actual resource webpage；Wherein, in each subset comprising being provided with consolidated network under same website The related multiple actual resource webpages of source entity；

Track also atomic unit is accessed, in same subset, address and institute according to each actual resource webpage The address of referer is stated, is restored under user accesses correspondence website and track is accessed to the entity of map network resource entity；

The catalogue page acquiring unit, including：

Subelement is deleted in contrast, for accessing track, contrast target entity resource webpage correspondence in an entity Quote the address that page address and the entity access other actual resource webpages on track, if target entity resource webpage correspondence The address for quoting page address and any one other actual resource webpages it is identical, then the actual resource webpage is defined as entity The non-starting point webpage on track is accessed, and the actual resource webpage is deleted from the access track；

Loop control subelement, repeats, for controlling the contrast to delete subelement until the entity accesses track On there is no any actual resource webpage it is corresponding quote page address it is identical with the address of other actual resource webpages；

Starting point webpage determination subelement, for the entity to be accessed into the corresponding reference of remaining actual resource webpage on track Page is defined as the starting point webpage that the entity is accessed on track.

Alternatively, the subset division subelement, specifically for：

Alternatively, the catalogue page acquiring unit, specifically for：

Alternatively, also include：

Starting point home page filter unit, for after the multiple starting point webpage is got, judging the multiple starting point webpage It is whether related to the consolidated network resource entity of the same website, and incoherent starting point webpage is filtered.

Alternatively, the procedural information acquiring unit, including：

User clicks on address acquisition subelement, points out the address of webpage during user browses webpage for obtaining, And the address of referer corresponding with pointed out webpage；

User clicks on address filtering subelement, for the advance physical name for obtaining, and/or actual resource address, to The address of webpage is pointed out at family, and the address of the referer is filtered, and obtains address and the reference that user points out Page address in the physical name, and/or the actual resource addresses match address.

Alternatively, also including actual resource address acquisition unit, for obtaining the actual resource ground in the following manner Location：

And/or,

Alternatively, also including physical name acquiring unit, for obtaining the physical name in the following manner：

The method and device of identification Internet resources entity catalogue page according to embodiments of the present invention, can browse net to user During page, the procedural information for pointing out the actual resource webpage related to Internet resources entity is obtained, according to this process The entity that information reverting user accesses particular network resource entity accesses track, and therefrom determines that entity accesses the starting point of track Web page address, finally determines the catalogue page of Internet resources entity according to this starting point web page address, the embodiment of the present invention be as What is comformed and identifies that the catalogue page of Internet resources entity provides a kind of practicable realization in multiple web pages automatically by program Scheme, the process of identification is not need to rely on text feature of catalogue page etc., and the catalogue page identification to Internet resources entity is efficient Accurately, with stronger scalability.

Further, the procedural information of acquired access entity resource webpage, can access from the whole network user and browse net Obtained during page, the starting point web page address on track is accessed further according to entity, determine the mesh of the particular network resource entity During record page, the access data statistics of the whole network user can be considered, the consolidated network resource of the same website for getting The corresponding multiple starting point webpages of entity, the statistics according to the whole network is screened, so as to meet the starting point net of screening conditions Page is defined as catalogue page of the correspondence particular network resource entity in correspondence website, further increases the standard for determining catalogue page method True property.

Described above is only the general introduction of technical solution of the present invention, in order to better understand technological means of the invention, And can be practiced according to the content of specification, and in order to allow the above and other objects of the present invention, feature and advantage can Become apparent, below especially exemplified by specific embodiment of the invention.

Brief description of the drawings

In order to illustrate more clearly about the embodiment of the present invention or technical scheme of the prior art, below will be to institute in embodiment The accompanying drawing for needing to use is briefly described, it should be apparent that, drawings in the following description are only some implementations of the invention Example, for those of ordinary skill in the art, on the premise of not paying creative work, can also obtain according to these accompanying drawings Obtain other accompanying drawings.In the accompanying drawings：

Fig. 1 is the method flow diagram of identification Internet resources entity catalogue page according to an embodiment of the invention；

Fig. 2 is the schematic device of identification Internet resources entity catalogue page according to an embodiment of the invention.

Specific embodiment

Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete Site preparation is described, it is clear that described embodiment is only a part of embodiment of the invention, rather than whole embodiments.It is based on Embodiment in the present invention, the every other embodiment that those of ordinary skill in the art are obtained belongs to present invention protection Scope.

In embodiments of the present invention, it is to access the webpage produced during webpage according to user to redirect path to find net The catalogue page of network resource entity.Its principle is：User is during Internet resources entity is accessed, it will usually from catalogue page point Go out to specific chapters and sections page, and seldom can return to catalogue page from specific chapters and sections page, according to this feature, it is possible to by system Meter analysis user obtains the catalogue page of Internet resources entity to an access path for Internet resources entity.Below to specific Implementation process is introduced in detail.

Referring to Fig. 1, the embodiment of the present invention provide firstly a kind of method for recognizing Internet resources entity catalogue page, the method May comprise steps of：

S110：Acquisition user is browsed during webpage, points out the mistake of the actual resource webpage related to Internet resources entity Journey information；

Generally, during user uses the various webpages of browser access, browser or its plug-in unit can be with The forms such as travel log record access situation of the user to webpage.For example, user is after certain opening browser, first at one Certain webpage A is opened in Shipping Options Page, is linked by certain in webpage clicking A afterwards, open certain webpage B, etc., browser or It is relevant with the webpage that user browses that person its plug-in unit just can respectively record address, web page title of each webpage of access etc. Information, in addition, for being similar to this webpages opened by way of link is redirected of webpage B, can also record the webpage is From which webpage（Referred to as referer）In link redirect, etc..In this manner it is possible to record each be accessed by the user The webpage crossed, and the webpage the information such as URL addresses, title, referer.For the ease of description, will can be accessed by the user The webpage crossed regards point one by one as, and certainly, the webpage for now getting all is some discrete points, that is to say, that now, Simply known browsed which webpage of user, and each webpage some information（Address, title including webpage, reference Page address etc.）, but sequencing that relation namely each webpage between each point are clicked etc. not yet knows, follow-up step Need to reduce the access track of user in rapid, so that each point is constituted into track according to clicking operation relation, and therefrom Find out the start page that entity is accessed on track.In the webpage that user is browsed, a portion is and Internet resources entity phase Close, only extract this part webpage, can just judge which or which webpage is the catalogue of Internet resources entity Page.

Specifically in the webpage that extraction is related to particular network resource entity, there can be various ways.For example, one of which Under mode, the physical name of some Internet resources entities can be in advance obtained, constitute physical name set, accessed user is got Webpage information after, the web page title on each aspect can be taken out, if in title exist and physical name set in certain The information that one physical name matches, then the point is exactly the point related to Internet resources entity is accessed.And then just can be by this point Extract.Certainly, in the webpage accessed by a user, it is understood that there may be multiple different Internet resources entities, now, Can respectively extract which point each different physical name corresponds to respectively, and then when entity access track is reduced, also all be Reduced for same entity respectively.

For example, the webpage that user accessed has A, B, C, D, E, F, by judging to find, in the title of B, C, D therein all Containing certain physical name, then can be using B, C, D as the point related to the Internet resources entity.If containing another in E, F therein One physical name, then can be using E, F as point related to another Internet resources entity, etc..In a word, for each entity Name, can count the set of multiple point compositions, and these points may not be to be recorded in a navigation process, but Can be integrated into being counted together.

Wherein, in order to whether in the title for judging each point, comprising physical name, a physical name collection can be pre-build Close, wherein comprising multiple physical names.In this manner it is possible to directly judge each point title in whether there is with physical name set in Information match content, if it is present comprising physical name in may certify that the title of the point.Setting up physical name set When, can be the Anchor Text of hyperlink in the known Internet resources entity index page of crawl by the way of, to Anchor Text Noise reduction filtering is carried out, physical name is extracted from Anchor Text.That is, in webpage, the link of Internet resources entity typically can Using physical name as Anchor Text.For example, there are many links of TV play in certain navigation website, each link here is general All it is that, so as to user's identification, this navigation website is commonly referred to as Internet resources reality directly by the name referred to as Anchor Text of TV play The index page of body.After the link that user clicks on certain TV play in the index page, it is possible to enter the catalogue page of the TV play, its In show it is each collection TV play broadcasting entrance.Therefore, by extracting the Anchor Text of this link, it is possible to get entity Name.That is, the content of some specific Internet resources entity index pages can be captured, and therefrom extract the anchor text for including This, the Anchor Text for then grabbing these is collected as the physical name of Internet resources entity.So, it is a large amount of by collecting Physical name, by duplicate removal, noise reduction etc. process after can just set up physical name set, for follow-up to user data Screened, or each address on webpage access track is screened.Specifically, the physical name that will be got and user The title of the webpage pointed out, and the title of referer is matched, if the title of the webpage that user points out, or referer Contain certain physical name in title, then it is assumed that be the address relevant with Internet resources entity and retain, otherwise filter the address.

User can equally record Internet resources reality in the actual resource webpage that access is related to Internet resources entity The address of the actual resource webpage that body phase is closed, the address of corresponding referer, further, it is also possible to obtain described in actual resource webpage The information such as website, and then these information, the process to customer access network resource entity can be utilized to reduce.Obtaining During user points out the procedural information of actual resource webpage, can first obtain during user browses webpage and be pointed out The address of webpage, and referer corresponding with pointed out webpage address；And then with the advance physical name for obtaining, and/or in fact Body resource address, the address of webpage is pointed out to user, and the address of referer is filtered, obtain the address pointed out of user with And in the referer with the physical name, and/or actual resource addresses match address.

Such as, the actual resource address related to Internet resources entity can in advance be obtained.The specific method for obtaining can be with It is any one or a few the combination in the following manner：

(1) the HTML html tag code of the hyperlink in known navigation page, extracts entity money Various webpages can typically be carried out taxonomic revision, accordingly, it is possible to root by source address in having the page of navigation feature due to some According to the HTML of the navigation page of known particular category（Hypertext Markup Language, HTML）Label, Extract the banner of particular category.For example, the address of the webpage for novel class, can be by capturing " http:// 123.sogou.com/xiaoshuo/”（The novel class navigation page of search dog）Html in, label<a>Under all url conducts The address of novel class.Certainly, for same category, can also be captured from multiple different navigation pages, and gone Process again, finally give the address of novel class webpage；

(2) address comprising particular keywords is obtained as actual resource address from the web page storage folder of user；As used The network address collected in the collection of family, its title contains " caricature " one word, then using the network address as an actual resource address； The web page storage folder of user can include local collection, network profile；

(3) whether judge in the directory name of user's webpage collection comprising particular keywords, if extracting catalogue comprising if In address as the actual resource address；For example include " caricature " one word in the directory name of user's collection, then should Collection network address under catalogue is used as actual resource address.The web page storage folder of user can include local collection, network collection Folder etc..The webpage of collection can typically be carried out classification collection, namely collection by some users in the collection using browser In typically have multiple catalogues, each catalogue is named according to the classification of webpage, accordingly it is also possible to special from the collection of user Name the web page address that correspondence classification is extracted under the catalogue of title.Certainly, when implementing, various types of other web page address is collected Work be usually to carry out in server end, and the terminal device that the collection of user may be only stored in user is local, because This, in the case of can also being allowed user is got, the information in user's collection is uploaded onto the server, for server point Analysis is used.Certainly, in actual applications, in order to use collection on different terminal devices, user may use Network profile function, namely the information of collection synchronous can be saved in server end, so when user is in different terminal devices During upper use browser, can be by the synchronization that signs in the account of oneself to carry out collection.Therefore, used in user In the case of network profile function, server end can be directly obtained the information in each user's collection, and then carry out The extraction and collection of the web page address of particular category.

(4) site address of particular keywords in the title of website homepage is obtained as actual resource address；Such as website In the title of homepage, if comprising keywords such as certain caricature, novels, can be using the address of the website as actual resource ground Location；If that is, in some websites provide Internet resources entity and its catalogue page, in the title of the homepage of website, Can typically the keyword of particular network resource entity be included, therefore the address of this website can be obtained as actual resource ground Location.

In actual application, net can be pointed out to user using only the physical name for getting, or actual resource address Filtered the address of page, and the address of corresponding referer, it is also possible to while according to physical name and the entity money for getting Source address, carries out dual filtering, to obtain more preferable filter effect, while reducing the data volume of the treatment of subsequent step.

S120：The entity access track that user accesses particular network resource entity is restored according to the procedural information；

Getting during user browses webpage, pointing out the process of the actual resource webpage related to Internet resources entity After information, the entity that can access particular network resource entity according to procedural information also original subscriber accesses track.Wherein, process letter Breath can include the address of actual resource webpage, be related to entity equivalent to which webpage in being realised that the webpage that user points out , but where these related webpages are usually the specific a certain specific collection of drama or a certain chapter content of Internet resources entity Webpage.For example, after user opens the catalogue page A of certain entity, the first collection of certain serial is therefrom clicked on, in the webpage B for pointing out The particular content of the entity first collection is browsed, now, webpage B falls within the webpage related to Internet resources entity.Therefore, connect Get off to need the thing done is exactly to determine it may is Internet resources reality according to these webpages related to Internet resources entity The webpage of the catalogue page of body.Therefore, in embodiments of the present invention, when can access user particular network resource entity first Entity accesses track and is reduced, namely judges each webpage related to Internet resources entity is how to be clicked on by user , and the starting point webpage network address on the track is determined, the webpage corresponding to this starting point webpage network address may be network money The catalogue page of source entity, certainly, the data of unique user there may be the factors such as certain contingency, finally can also it is comprehensive its The data of his user carry out collecting judgement, therefore the catalogue page obtained from a certain user can be temporarily as Internet resources entity Alternative catalogue page.

Entity accesses track and can be expressed as specific tuple-set, for example, be expressed as two tuple-sets：

{(url₁,refer₁),(url₂,refer₂)……(url_i,refer_i)…}

The wherein address of url presentation-entity resource webpage；Refer represents referer when pointing out the actual resource webpage Address, the subscript of the address url and reference page address refer of actual resource webpage can not indicate that the access of webpage is successively suitable Sequence.In embodiments of the present invention for the ease of the catalogue page of acquisition Internet resources entity, and recognize different websites, different nets The catalogue page of network actual resource, entity accesses track and can be expressed as specific four-tuple set, wherein, specifically visited to user When entity access track when asking particular network resource entity is reduced, the point of various discrete can respectively be used into quaternary first Group is represented, for example, this four-tuple can be expressed as (url, refer, entity, site）, wherein, url presentation-entity money The address of source web page；Refer represents the address of referer when pointing out the actual resource webpage；Entity represents Internet resources reality The mark of body, can be the title of Internet resources entity, the acute name of such as serial, the title of caricature, novel name etc.；Site generations Website belonging to table actual resource webpage.For example, for certain webpage B accessed by certain user, certain is contained in the title of webpage B Physical name M, and B is redirected after certain link that user is clicked in A, then and in the four-tuple of B node, url is just It is B webpages network address in itself, refer is exactly the network address of A webpages, and entity can be represented by physical name M, such as " sea thief King ", " fiery shadow person of bearing " etc., site is exactly the website belonging to webpage B.

After so by each node quadruple notation, can be according to the corresponding Internet resources entity of actual resource webpage And affiliated website, actual resource webpage is divided into multiple subsets；Wherein, in each subset comprising under same website with it is same The related multiple actual resource webpages of one Internet resources entity；In same subset, according to the address of each actual resource webpage And the address of referer, restore the entity access track that user accesses map network resource entity under correspondence website.User Entity when consolidated network resource entity is accessed under same website accesses track and can be expressed as：

Set(same(entity,site)){(url₁,refer₁),(url₂,refer₂)…..(url_i,refer_i)..}

The subscript of the address url and reference page address refer of above-mentioned each actual resource webpage is not represented between webpage Order information, namely lower target order can not represent access order of the user to each actual resource webpage, now represent Each point still can be discrete.

According to the corresponding Internet resources entity of actual resource webpage and affiliated website, actual resource webpage is divided into During multiple subsets, the reality can be matched using the method for matching most long with the physical name of the advance Internet resources entity for obtaining The actual resource webpage is divided into multiple subsets by the title of body resource webpage, the result according to matching.For example have " by liking " " love is carried through to the end " two physical names, if not matched in the method for matching most long now, are carried out with " by liking " During matching, then the two physical names can be matched, this is not corresponded with actual situation, and passes through the method for most growing matching The title of the actual resource webpage is matched, then can be very good to distinguish so physical name with character inclusion relation Come.

S130：The starting point web page address that the entity is accessed on track is obtained, rising on track is accessed according to the entity Point web page address, determines the catalogue page of the particular network resource entity.

The starting point web page address that entity is accessed on track is obtained, the particular network resource reality is determined according to starting point web page address During the catalogue page of body, can specifically be handled as follows：

Accessed on track in an entity, the corresponding page address of quoting of contrast target entity resource webpage is accessed with the entity The address of other actual resource webpages on track, if target entity resource webpage it is corresponding quote page address with any one its His address of actual resource webpage is identical, then the actual resource webpage is defined as into the non-starting point webpage on entity access track, And the actual resource webpage is deleted from access track；

The corresponding referer of remaining actual resource webpage on entity access track is defined as into entity to access on track Starting point webpage.This process is specifically described below.

Accessed on track in an entity, as url in arbitrary two 2 tuple_i=refer_jWhen, then delete and include refer_j Binary pair.That is, it is assumed that the webpage related to certain entity has A, B, C, D, wherein, A is that certain search results pages X is pointed out , B is pointed out from the link in webpage A, and C is pointed out from the link in webpage B, and D is pointed out from the link in webpage C 's.Same physical name is all included in B, C, D therein, also, these three webpages belong to same website, then now,（url₁= B, refer₁=A）,（url₂=C, refer₂=B）,（url₃=D, refer₃=C）, it is seen then that url₁=refer₂, url₂=refer₃, because This, B, C, D belong to the webpage that same entity is accessed on track, and the user can be accessed the track reduction of the Internet resources entity It is A->B->C->D, and can be refer by source page₂、refer₃Point C, D delete, refer pairs of the point B for retaining The webpage A for answering can serve as the starting point webpage that entity is accessed on track, and as an alternative catalogue page, with This analogizes.So, for same user, under the same website in each webpage related to same entity, it is possible to extract Multiple alternative catalogue pages.

That is, for same user, in its access process can group with Internet resources entity related web page Into a webpage complete or collected works I, after the collections of web pages M related to Internet resources entity is therefrom filtered out, can be by set M Webpage is divided into multiple subset Mi according to website（Wherein, i=1,2,3 ..., maximum occurrences are the total of the website that includes in set M Number）, then again in each subset Mi, multiple subset Mij can be further subdivided into according to corresponding entity again（J=1,2,3 ..., most Big value is the number of entities included in set Mi）, this, included in each subset Mij be exactly same website under with it is same The related webpage of one Internet resources entity；Then, in each subset Mij, each webpage is come in the form of foregoing binary pair Represent, also, if the source page refer of certain binary centering is identical with the url of another binary centering, then by this source net Page is not necessarily just that entity accesses starting point webpage on track to corresponding webpage for the binary of refer, therefore can will be this Binary so, eventually causes to retain next part binary pair in each subset Mij to deleting, what this part was retained Binary pair, it is possible to as the alternative catalogue page of map network resource entity under correspondence website.

Certainly, in actual applications, there can also be the method for other reduction entity access path, such as with reference to each net The access time information of page reduce, etc..

By above-mentioned steps, for same user, its each Internet resources entity for accessing can be counted Alternative catalogue page under each website, it is similar, for other users, also can respectively count what it was accessed Alternative catalogue page of each Internet resources entity under each website, in this manner it is possible to standby by what is counted from these users Select catalogue page to be collected, and finally determine the catalogue page of Internet resources entity.For example, specifically when collecting, can obtain Corresponding more than two entities of consolidated network resource entity of same website access the multiple starting point webpages on track, and each is used Family is represented group with following binary under same website for the alternative catalogue page of consolidated network resource entity：

Set(same(entity,site)){user1(urli,referi),user2(urli,referi),user3 (urli,referi)....us ern(urli,referi)}

Then count each starting point webpage occurrence number in the multiple starting point webpage respectively, will appear from number of times highest or Proportion meets the starting point webpage of prerequisite and is defined as the catalogue of the Internet resources entity under the website in total degree Page.For example, will appear from alternative catalogue page of the number of times station total degree ratio more than 50% as the Internet resources entity under the website Catalogue page, etc..

For example：Assuming that being the browser experiences webpage that a user accessed in the works shown in table 1 below：

Table 1

url	Refer
		http://www.dm5.com/m129008/	http://www.dm5.com/manhua-area-d-yinenglingyu/
http://www.dm5.com/m129008-p2/	http://www.dm5.com/m129008/
		http://news.baidu.com/	http://www.baidu.com/
http://www.dm5.com/m129008-p3/	http://www.dm5.com/m129008-p2/
		http://www.dm5.com/m129008-p4/	http://www.dm5.com/m129008-p3/
……	……

Wherein, http://news.baidu.com/ and http://www.baidu.com/ be Internet resources entity without The webpage of pass, therefore, it is deleted from collections of web pages.In other each webpages, http://www.dm5.com/ m129008/、http://www.dm5.com/m129008-p2/、http://www.dm5.com/m129008-p3/、 http://www.dm5.com/m129008-p4/ is the webpage related to caricature entity " the different energy fields of Area D " is accessed.And And the path for accessing the entity can be reduced to：

http://www.dm5.com/manhua-area-d-yinenglingyu/

->http://www.dm5.com/m129008/

->http://www.dm5.com/m129008-p2/

->http://www.dm5.com/m129008-p3/

……

And then, the starting point webpage " http on the path can be taken://www.dm5.com/manhua-area-d- Yinenglingyu/ " alternately catalogue pages.Likewise, the collections of web pages that other users were accessed, also can be according to above-mentioned Method is processed, and is so directed to consolidated network resource entity, it is possible to counted in the collections of web pages of each user respectively Alternative catalogue page under each website, e.g., according to each user to the access path of caricature entity " Area D different can field " Can obtain, the alternative catalogue page of the caricature is as shown in table 2：

Table 2

http://www.dm5.com/m115189-p5/
	http://www.dm5.com/m115725/
http://www.dm5.com/manhua-area-d-yinenglingyu/
	http://www.dm5.com/manhua-area-d-yinenglingyu/
http://www.dm5.com/manhua-area-d-yinenglingyu/
	http://www.dm5.com/manhua-area-d-yinenglingyu/
http://www.dm5.com/manhua-area-d-yinenglingyu/
	http://www.dm5.com/manhua-area-d-yinenglingyu/
http://www.dm5.com/manhua-area-d-yinenglingyu/
	http://www.dm5.com/manhua-area-d-yinenglingyu/
http://www.imanhua.com/comic/3401/
	http://www.imanhua.com/comic/3401/
http://www.imanhua.com/comic/3401/
	http://www.imanhua.com/comic/3401/
http://www.imanhua.com/comic/3401/
	http://www.imanhua.com/comic/3401/
http://www.imanhua.com/comic/3401/list_66230.html
	http://www.imanhua.com/comic/3401/list_66401.html?p=17

It can be seen that, for the caricature entity, counted under www.dm5.com, www.imanhua.com the two websites respectively Go out multiple alternative catalogue pages, wherein, under website www.dm5.com, http://www.dm5.com/manhua-area- The occurrence number of d-yinenglingyu/ be 8 times, and under the website alternative catalogue page sum be 10 times, therefore, it can by http://www.dm5.com/manhua-area-d-yinenglingyu/ is defined as the caricature entity in website Catalogue page under www.dm5.com.Similar, can be by http://www.imanhua.com/comic/3401/ is defined as this Catalogue page of the caricature entity under website www.imanhua.com.

In addition, in order to the result further to judging is optimized, obtaining alternative from the statistics of each user After catalogue page, these alternative catalogue pages can also be filtered first.I.e. after multiple starting point webpages are got, judge many Whether individual starting point webpage is related to the consolidated network resource entity of same website, and incoherent starting point webpage is filtered.Specifically When realizing, can with judging whether alternative catalogue page is to search for the related network address in website, or website homepage network address etc., if Be can then determine this/some alternative catalogue pages are uncorrelated to physical network resource, and then by these it is uncorrelated fall alternative catalogue Page is filtered.

In a word, in embodiments of the present invention, user can be browsed during webpage, is pointed out related to Internet resources entity The procedural information of actual resource webpage obtained, according to this procedural information, also original subscriber accesses particular network resource entity Entity access track, and therefrom determine entity access track starting point web page address, finally according to this starting point webpage ground The catalogue page of Internet resources entity is determined in location, and the embodiment of the present invention identify in multiple web pages how to be comformed automatically by program The catalogue page of Internet resources entity provides a kind of practicable implementation, and the process of identification is not need to rely on catalogue page Text feature etc., the catalogue page to Internet resources entity recognizes efficiently and accurately, and scalability is stronger.

Method with identification Internet resources entity catalogue page provided in an embodiment of the present invention is corresponding, and the embodiment of the present invention is also There is provided a kind of device for recognizing Internet resources entity catalogue page, referring to Fig. 2, the device can specifically include：

Procedural information acquiring unit 210, for obtaining during user browses webpage, points out related to Internet resources entity Actual resource webpage procedural information；

Track reduction unit 220 is accessed, particular network resource entity is accessed for restoring user according to procedural information Entity accesses track；

Catalogue page acquiring unit 230, for obtaining the starting point web page address on entity access track, rail is accessed according to entity Starting point web page address on mark, determines the catalogue page of the particular network resource entity.

Wherein, the procedural information for accessing the actual resource webpage related to Internet resources entity can include actual resource net The address of the website belonging to page, the address of actual resource webpage, and referer when pointing out actual resource webpage；

Under this implementation, accessing track reduction unit 220 can include：

Subset division subelement, for according to the corresponding Internet resources entity of actual resource webpage and affiliated website, Actual resource webpage is divided into multiple subsets；Wherein, in each subset comprising under same website with consolidated network resource entity Related multiple actual resource webpages；

Track also atomic unit is accessed, in same subset, address according to each actual resource webpage and is drawn With the address of page, restore the entity that user accessed under correspondence website to map network resource entity and access track；

Catalogue page acquiring unit 230 can include：

Subelement is deleted in contrast, and for an entity access track, contrasting, target entity resource webpage is corresponding to be drawn The address of other actual resource webpages on track is accessed with page address and the entity, if target entity resource webpage is corresponding drawn Address with page address and any one other actual resource webpages is identical, then the actual resource webpage is defined as into entity accesses Non- starting point webpage on track, and the actual resource webpage is deleted from access track；

Loop control subelement, repeats, for controlling contrast to delete subelement until the entity is accessed on track not Any further actual resource webpage is corresponding, and to quote page address identical with the address of other actual resource webpages；

Starting point webpage determination subelement, for the entity to be accessed into the corresponding reference of remaining actual resource webpage on track Page is defined as the starting point webpage that entity is accessed on track.

Under another implementation, subset division subelement specifically can be used for：

With the physical name of the advance Internet resources entity for obtaining, using the method matching entities resource webpage of matching most long Actual resource webpage is divided into multiple subsets by title, the result according to matching.

Catalogue page acquiring unit 230, specifically can be used for：

Each starting point webpage occurrence number in multiple starting point webpages is counted respectively, and will appear from number of times meet prerequisite Starting point webpage, is defined as catalogue page of the correspondence particular network resource entity in correspondence website.

Under this implementation, the device of the identification Internet resources entity catalogue page can also include：

Starting point home page filter unit, for after multiple starting point webpages are got, judge multiple starting point webpages whether with together The consolidated network resource entity of one website is related, and incoherent starting point webpage is filtered.

Additionally, procedural information acquiring unit 210 can include：

User clicks on address filtering subelement, for the advance physical name for obtaining, and/or actual resource address, to The address of webpage is pointed out at family, and the address of referer is filtered, address and the address of referer that acquisition user points out In with physical name, and/or actual resource addresses match address.

Under this implementation, the device can also include actual resource address acquisition unit, for by with lower section Formula obtains actual resource address：

The HTML html tag code of the hyperlink in known navigation page, extracts actual resource ground Location；

And/or,

The address comprising particular keywords is obtained as actual resource address from the web page storage folder of user；

And/or,

Judge whether include particular keywords in the directory name of user's webpage collection, in catalogue is extracted comprising if Address is used as actual resource address；

And/or,

The site address of particular keywords in the title of website homepage is obtained as actual resource address.

Additionally, under another implementation, the device of the identification Internet resources entity catalogue page can also include entity Name acquiring unit, for obtaining physical name in the following manner：

Noise reduction filtering is carried out to Anchor Text, physical name is extracted from Anchor Text.

In the said apparatus of the embodiment of the present invention, user can be browsed during webpage, pointed out and Internet resources reality The procedural information of the actual resource webpage that body phase is closed is obtained, and particular network money is accessed according to this procedural information also original subscriber The entity of source entity accesses track, and therefrom determines that entity accesses the starting point web page address of track, finally according to this starting point Web page address determines the catalogue page of Internet resources entity, and how the embodiment of the present invention to be comformed in multiple web pages automatically by program Identify that the catalogue page of Internet resources entity provides a kind of practicable implementation, the process of identification is not need to rely on Text feature of catalogue page etc. such that it is able to efficiently and accurately identify the catalogue page of Internet resources entity.

As seen through the above description of the embodiments, those skilled in the art can be understood that the present invention can Realized by the mode of software plus required general hardware platform.Based on such understanding, technical scheme essence On the part that is contributed to prior art in other words can be embodied in the form of software product, the computer software product Can store in storage medium, such as ROM/RAM, magnetic disc, CD, including some instructions are used to so that a computer equipment （Can be personal computer, server, or network equipment etc.）Perform some of each embodiment of the invention or embodiment Method described in part.

Each embodiment in this specification is described by the way of progressive, identical similar portion between each embodiment Divide mutually referring to what each embodiment was stressed is the difference with other embodiment.Especially for device or For system embodiment, because it is substantially similar to embodiment of the method, so describing fairly simple, related part is referring to method The part explanation of embodiment.Apparatus and system embodiment described above is only schematical, wherein the conduct Separating component explanation unit can be or may not be it is physically separate, the part shown as unit can be or Person may not be physical location, you can with positioned at a place, or can also be distributed on multiple NEs.Can be with root Some or all of module therein is factually selected the need for border to realize the purpose of this embodiment scheme.Ordinary skill Personnel are without creative efforts, you can to understand and implement.

Above to the method and device of identification Internet resources entity catalogue page provided by the present invention, detailed Jie has been carried out Continue, specific case used herein is set forth to principle of the invention and implementation method, the explanation of above example is only It is to be used to help understand the method for the present invention and its core concept；Simultaneously for those of ordinary skill in the art, according to this hair Bright thought, will change in specific embodiments and applications.In sum, this specification content should not be managed It is limitation of the present invention to solve.

Claims

1. it is a kind of recognize Internet resources entity catalogue page method, it is characterised in that including：

Acquisition user is browsed during webpage, points out the procedural information of the actual resource webpage related to Internet resources entity；

The starting point web page address that the entity is accessed on track is obtained, the starting point webpage ground on track is accessed according to the entity Location, determines the catalogue page of the particular network resource entity, specifically includes：

Obtain multiple starting point webpages that corresponding more than two entities of consolidated network resource entity of same website are accessed on track；

Each starting point webpage occurrence number in the multiple starting point webpage is counted respectively, and will appear from number of times meet prerequisite Starting point webpage, is defined as catalogue page of the correspondence particular network resource entity in correspondence website.

2. method according to claim 1, it is characterised in that the procedural information is included belonging to the actual resource webpage Website, the address of the actual resource webpage, and referer when pointing out the actual resource webpage address；

It is described that the entity access track that user accesses particular network resource entity is restored according to the procedural information, including：

According to the corresponding Internet resources entity of the actual resource webpage and affiliated website, the actual resource webpage is drawn It is divided into multiple subsets；Wherein, comprising multiple entities money related to consolidated network resource entity under same website in each subset Source web page；

In same subset, the address of address and the referer according to each actual resource webpage restores user's visit Ask and track is accessed to the entity of map network resource entity under correspondence website；

Accessed on track in an entity, the corresponding page address of quoting of contrast target entity resource webpage is accessed with the entity The address of other actual resource webpages on track, if target entity resource webpage it is corresponding quote page address with any one its His address of actual resource webpage is identical, then the actual resource webpage is defined as into the non-starting point webpage on entity access track, And the actual resource webpage is deleted from the access track；

Previous step is repeated, until the entity is accessed on track there is no the corresponding referer ground of any actual resource webpage Location is identical with the address of other actual resource webpages；

The corresponding referer of remaining actual resource webpage on entity access track is defined as into the entity to access on track Starting point webpage.

3. method as claimed in claim 2, it is characterised in that described according to the corresponding Internet resources of the actual resource webpage Entity and affiliated website, multiple subsets are divided into by the actual resource webpage, including：

With the physical name of the advance Internet resources entity for obtaining, the actual resource webpage is matched using the method for matching most long The actual resource webpage is divided into multiple subsets by title, the result according to matching.

4. method according to claim 1, it is characterised in that also include：

After the multiple starting point webpage is got, judge the multiple starting point webpage whether the same net with the same website Network resource entity is related, and incoherent starting point webpage is filtered.

5. the method according to claim any one of 1-4, it is characterised in that the acquisition user is browsed during webpage, The procedural information of the actual resource webpage related to Internet resources entity is pointed out, including：

Obtain and the address of webpage pointed out during user browses webpage, and referer corresponding with pointed out webpage ground Location；

With the advance physical name for obtaining, and/or actual resource address, the address of webpage, and the referer are pointed out to user Address filtered, obtain in the address of the address pointed out of user and the referer with the physical name, it is and/or described The address of actual resource addresses match.

6. method according to claim 5, it is characterised in that also include, the entity is obtained beforehand through in the following manner Resource address：

The HTML html tag code of the hyperlink in known navigation page, extracts the actual resource ground Location；

And/or,

Judge whether include particular keywords in the directory name of user's webpage collection, the address in catalogue is extracted comprising if As the actual resource address；

And/or,

7. method according to claim 5, it is characterised in that obtain the physical name beforehand through in the following manner：

8. it is a kind of recognize Internet resources entity catalogue page device, it is characterised in that including：

Procedural information acquiring unit, for obtaining during user browses webpage, points out the entity related to Internet resources entity The procedural information of resource webpage；

Track reduction unit is accessed, for restoring the entity that user accesses particular network resource entity according to the procedural information Access track；

Catalogue page acquiring unit, for obtaining the starting point web page address on the entity access track, accesses according to the entity Starting point web page address on track, determines the catalogue page of the particular network resource entity, specifically for：

9. device according to claim 8, it is characterised in that the procedural information is included belonging to the actual resource webpage Website, the address of the actual resource webpage, and referer when pointing out the actual resource webpage address；

Access track reduction unit, including：

Subset division subelement, for according to the corresponding Internet resources entity of the actual resource webpage and affiliated website, The actual resource webpage is divided into multiple subsets；Wherein, in each subset comprising under same website with consolidated network resource The related multiple actual resource webpages of entity；

Track also atomic unit is accessed, in same subset, address according to each actual resource webpage and described is drawn With the address of page, restore the entity that user accessed under correspondence website to map network resource entity and access track；

The catalogue page acquiring unit, including：

Subelement is deleted in contrast, and for an entity access track, contrasting, target entity resource webpage is corresponding to be drawn The address of other actual resource webpages on track is accessed with page address and the entity, if target entity resource webpage is corresponding drawn Address with page address and any one other actual resource webpages is identical, then the actual resource webpage is defined as into entity accesses Non- starting point webpage on track, and the actual resource webpage is deleted from the access track；

Loop control subelement, repeats, for controlling the contrast to delete subelement until the entity is accessed on track not Any further actual resource webpage is corresponding, and to quote page address identical with the address of other actual resource webpages；

Starting point webpage determination subelement, for the corresponding referer of remaining actual resource webpage on entity access track is true It is set to the starting point webpage that the entity is accessed on track.

10. device as claimed in claim 9, it is characterised in that the subset division subelement, specifically for：

11. devices according to claim 8, it is characterised in that also include：

Starting point home page filter unit, for after the multiple starting point webpage is got, whether judging the multiple starting point webpage Consolidated network resource entity to the same website is related, and incoherent starting point webpage is filtered.

12. device according to claim any one of 8-11, it is characterised in that the procedural information acquiring unit, including：

User clicks on address filtering subelement, for the advance physical name for obtaining, and/or actual resource address, to user's point Go out the address of webpage, and the address of the referer is filtered, obtain address that user points out and the referer With the physical name in address, and/or the actual resource addresses match address.

13. devices according to claim 12, it is characterised in that also including actual resource address acquisition unit, for leading to Cross in the following manner and obtain the actual resource address：

And/or,

14. devices according to claim 12, it is characterised in that also including physical name acquiring unit, for by following Mode obtains the physical name：