CN103605742B - Recognize the method and device of Internet resources entity catalogue page - Google Patents
Recognize the method and device of Internet resources entity catalogue page Download PDFInfo
- Publication number
- CN103605742B CN103605742B CN201310589670.7A CN201310589670A CN103605742B CN 103605742 B CN103605742 B CN 103605742B CN 201310589670 A CN201310589670 A CN 201310589670A CN 103605742 B CN103605742 B CN 103605742B
- Authority
- CN
- China
- Prior art keywords
- webpage
- entity
- address
- actual resource
- resource
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/955—Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Transfer Between Computers (AREA)
Abstract
The invention discloses the method and device of identification Internet resources entity catalogue page, wherein, methods described includes:Acquisition user is browsed during webpage, points out the procedural information of the actual resource webpage related to Internet resources entity;The entity access track that user accesses particular network resource entity is restored according to the procedural information;The starting point web page address that the entity is accessed on track is obtained, the starting point web page address on track is accessed according to the entity, determine the catalogue page of the particular network resource entity.By means of the invention it is possible to improve the scalability of identification catalogue page.
Description
Technical field
The present invention relates to webpage identification technology field, and in particular to the method and dress of identification Internet resources entity catalogue page
Put.
Background technology
Web browser is, for showing the file in web page server or archives economy, and to make user mutual with these files
A kind of dynamic software.It can be used to word, image and other information being displayed in WWW or LAN.These words
Or image, the hyperlink of other network address can be attached to, user can browse various moneys by way of clicking on various hyperlink
News.
In numerous abundant Internet resources, there are the special Internet resources of a class, this Internet resources Yi Ji, chapter, section etc.
It is unit, with continuity, and can be periodically updated.For example, certain serial, updates two and collects daily, certain caricature, often
Collection of Zhou Gengxin mono-, etc..For this Internet resources, general each specific entity can correspond to a catalogue page, in this mesh
In record page, show that each unit of the entity browses entrance.For example, certain entity is entitled " the different energy fields of Area D "
Caricature, then in the catalogue page of the caricature, the broadcasting entrance of each collection of drama of the caricature can be shown, this broadcasting entrance is general
Exist in the form of hyperlink, and with " the 1st collection ", " the 2nd integrates " etc. as Anchor Text, user can by click on it is a certain play into
Mouthful, jump to specific collection of drama and play out.If the author of the caricature is subsequently updated to the caricature, generate new
Collection of drama, then can show the broadcasting entrance of new collection of drama in the catalogue page of the caricature.It is often necessary to user actively pays close attention to and looks into
The renewal of catalogue page is looked for, to get the what be new of Internet resources.
In order to save the running cost of user, some browsers or browser plug-in can provide the user Internet resources
Update notifying service, for example, the update status that browser can be to certain Internet resources by the way of backstage are monitored, such as
Updating occurs in fruit, and user, user can be supplied to be obtained by clicking directly on the hyperlink hyperlink of latest network resource etc.
The latest update content of Internet resources is got, the operating procedure of the acquisition resource updates of user is reduced with this.For example, use householder
It is dynamic that newest collection of TV plays, newest caricature chapters and sections etc. are provided.
, it is necessary to the renewal of the catalogue page to Internet resources entity during the update status of above-mentioned acquisition Internet resources
Situation is monitored, and for for the application program for monitoring, net is identified in multiple web pages of how being comformed automatically by program
The catalogue page of network resource entity, is the technical issues that need to address in implementation process.In the prior art, typically can be according to catalogue page
Text feature catalogue page is recognized to mode that the content of text in webpage is analyzed.For example, being generally comprised in catalogue page
" ×× collection ", " ×× chapter " etc. with some regular texts, therefore, whether wrapped by the content of text for judging webpage
Containing meeting these regular texts, it is possible to judge a webpage whether be certain Internet resources catalogue page.But, this text
The mode of this judgement needs to pre-build some rules, if the text in certain webpage is unsatisfactory for presetting rule, can be filtered
Fall.Even if but in fact, being unsatisfactory for preset rule in the text of certain webpage, it is also possible to belong to catalogue page.It can be seen that, existing skill
The scalability of art is poor.
The content of the invention
In view of the above problems, it is proposed that the present invention so as to provide one kind overcome above mentioned problem or at least in part solve on
State the method and device of the identification Internet resources entity catalogue page of problem, it is possible to increase the scalability of identification catalogue page.
According to one aspect of the present invention, there is provided a kind of method of identification Internet resources entity catalogue page, its feature exists
In, including:
Acquisition user is browsed during webpage, points out the process letter of the actual resource webpage related to Internet resources entity
Breath;
The entity access track that user accesses particular network resource entity is restored according to the procedural information;
The starting point web page address that the entity is accessed on track is obtained, the starting point webpage on track is accessed according to the entity
Address, determines the catalogue page of the particular network resource entity.
Alternatively, the procedural information includes the website belonging to the actual resource webpage, the actual resource webpage
Address, and referer when pointing out the actual resource webpage address;
It is described that the entity access track that user accesses particular network resource entity, bag are restored according to the procedural information
Include:
According to the corresponding Internet resources entity of the actual resource webpage and affiliated website, by the actual resource net
Page is divided into multiple subsets;Wherein, it is real comprising multiple related to consolidated network resource entity under same website in each subset
Body resource webpage;
In same subset, the address of address and the referer according to each actual resource webpage restores use
Entity under family access correspondence website to map network resource entity accesses track;
The starting point web page address obtained on the entity access track, including:
Accessed on track in an entity, contrast target entity resource webpage is corresponding to quote page address and the entity
Access track on other actual resource webpages address, if target entity resource webpage it is corresponding quote page address with it is any one
The address of individual other actual resource webpages is identical, then the actual resource webpage is defined as into the non-starting point net on entity access track
Page, and the actual resource webpage is deleted from the access track;
Previous step is repeated, until the entity is accessed on track there is no the corresponding reference of any actual resource webpage
Page address is identical with the address of other actual resource webpages;
The corresponding referer of remaining actual resource webpage on entity access track is defined as the entity and accesses rail
Starting point webpage on mark.
Alternatively, it is described according to the corresponding Internet resources entity of the actual resource webpage and affiliated website, by institute
State actual resource webpage and be divided into multiple subsets, including:
With the physical name of the advance Internet resources entity for obtaining, the actual resource net is matched using the method for matching most long
The actual resource webpage is divided into multiple subsets by the title of page, the result according to matching.
Alternatively, it is described to obtain the starting point web page address that the entity is accessed on track, track is accessed according to the entity
On starting point web page address, determine the catalogue page of the particular network resource entity, including:
Obtain multiple starting points that corresponding more than two entities of consolidated network resource entity of same website are accessed on track
Webpage;
Each starting point webpage occurrence number in the multiple starting point webpage is counted respectively, and will appear from number of times meet preset bar
The starting point webpage of part, is defined as catalogue page of the correspondence particular network resource entity in correspondence website.
Alternatively, also include:
After the multiple starting point webpage is got, judge whether the multiple starting point webpage is same with the same website
One Internet resources entity is related, and incoherent starting point webpage is filtered.
Alternatively, during the acquisition user browses webpage, the actual resource net related to Internet resources entity is pointed out
The procedural information of page, including:
Obtain the address that webpage is pointed out during user browses webpage, and referer corresponding with pointed out webpage
Address;
With the advance physical name for obtaining, and/or actual resource address, the address of webpage is pointed out to user, and described drawn
Filtered with the address of page, obtain in the address of the address pointed out of user and the referer with the physical name, and/or
The address of the actual resource addresses match.
Alternatively, also include, the actual resource address is obtained beforehand through in the following manner:
The HTML html tag code of the hyperlink in known navigation page, extracts the entity money
Source address;
And/or,
The address comprising particular keywords is obtained as the actual resource address from the web page storage folder of user;
And/or,
Judge whether include particular keywords in the directory name of user's webpage collection, in catalogue is extracted comprising if
Address is used as the actual resource address;
And/or,
The site address of particular keywords in the title of website homepage is obtained as the actual resource address.
Alternatively, the physical name is obtained beforehand through in the following manner:
The Anchor Text of hyperlink in the known Internet resources entity index page of crawl;
Noise reduction filtering is carried out to the Anchor Text, the physical name is extracted from the Anchor Text.
According to another aspect of the present invention, there is provided a kind of device for recognizing Internet resources entity catalogue page, its feature exists
In, including:
Procedural information acquiring unit, for obtaining during user browses webpage, points out related to Internet resources entity
The procedural information of actual resource webpage;
Track reduction unit is accessed, particular network resource entity is accessed for restoring user according to the procedural information
Entity accesses track;
Catalogue page acquiring unit, for obtaining the starting point web page address on the entity access track, according to the entity
The starting point web page address on track is accessed, the catalogue page of the particular network resource entity is determined.
Alternatively, the procedural information includes the website belonging to the actual resource webpage, the actual resource webpage
Address, and referer when pointing out the actual resource webpage address;
Access track reduction unit, including:
Subset division subelement, for according to the corresponding Internet resources entity of the actual resource webpage and affiliated station
Point, multiple subsets are divided into by the actual resource webpage;Wherein, in each subset comprising being provided with consolidated network under same website
The related multiple actual resource webpages of source entity;
Track also atomic unit is accessed, in same subset, address and institute according to each actual resource webpage
The address of referer is stated, is restored under user accesses correspondence website and track is accessed to the entity of map network resource entity;
The catalogue page acquiring unit, including:
Subelement is deleted in contrast, for accessing track, contrast target entity resource webpage correspondence in an entity
Quote the address that page address and the entity access other actual resource webpages on track, if target entity resource webpage correspondence
The address for quoting page address and any one other actual resource webpages it is identical, then the actual resource webpage is defined as entity
The non-starting point webpage on track is accessed, and the actual resource webpage is deleted from the access track;
Loop control subelement, repeats, for controlling the contrast to delete subelement until the entity accesses track
On there is no any actual resource webpage it is corresponding quote page address it is identical with the address of other actual resource webpages;
Starting point webpage determination subelement, for the entity to be accessed into the corresponding reference of remaining actual resource webpage on track
Page is defined as the starting point webpage that the entity is accessed on track.
Alternatively, the subset division subelement, specifically for:
With the physical name of the advance Internet resources entity for obtaining, the actual resource net is matched using the method for matching most long
The actual resource webpage is divided into multiple subsets by the title of page, the result according to matching.
Alternatively, the catalogue page acquiring unit, specifically for:
Obtain multiple starting points that corresponding more than two entities of consolidated network resource entity of same website are accessed on track
Webpage;
Each starting point webpage occurrence number in the multiple starting point webpage is counted respectively, and will appear from number of times meet preset bar
The starting point webpage of part, is defined as catalogue page of the correspondence particular network resource entity in correspondence website.
Alternatively, also include:
Starting point home page filter unit, for after the multiple starting point webpage is got, judging the multiple starting point webpage
It is whether related to the consolidated network resource entity of the same website, and incoherent starting point webpage is filtered.
Alternatively, the procedural information acquiring unit, including:
User clicks on address acquisition subelement, points out the address of webpage during user browses webpage for obtaining,
And the address of referer corresponding with pointed out webpage;
User clicks on address filtering subelement, for the advance physical name for obtaining, and/or actual resource address, to
The address of webpage is pointed out at family, and the address of the referer is filtered, and obtains address and the reference that user points out
Page address in the physical name, and/or the actual resource addresses match address.
Alternatively, also including actual resource address acquisition unit, for obtaining the actual resource ground in the following manner
Location:
The HTML html tag code of the hyperlink in known navigation page, extracts the entity money
Source address;
And/or,
The address comprising particular keywords is obtained as the actual resource address from the web page storage folder of user;
And/or,
Judge whether include particular keywords in the directory name of user's webpage collection, in catalogue is extracted comprising if
Address is used as the actual resource address;
And/or,
The site address of particular keywords in the title of website homepage is obtained as the actual resource address.
Alternatively, also including physical name acquiring unit, for obtaining the physical name in the following manner:
The Anchor Text of hyperlink in the known Internet resources entity index page of crawl;
Noise reduction filtering is carried out to the Anchor Text, the physical name is extracted from the Anchor Text.
The method and device of identification Internet resources entity catalogue page according to embodiments of the present invention, can browse net to user
During page, the procedural information for pointing out the actual resource webpage related to Internet resources entity is obtained, according to this process
The entity that information reverting user accesses particular network resource entity accesses track, and therefrom determines that entity accesses the starting point of track
Web page address, finally determines the catalogue page of Internet resources entity according to this starting point web page address, the embodiment of the present invention be as
What is comformed and identifies that the catalogue page of Internet resources entity provides a kind of practicable realization in multiple web pages automatically by program
Scheme, the process of identification is not need to rely on text feature of catalogue page etc., and the catalogue page identification to Internet resources entity is efficient
Accurately, with stronger scalability.
Further, the procedural information of acquired access entity resource webpage, can access from the whole network user and browse net
Obtained during page, the starting point web page address on track is accessed further according to entity, determine the mesh of the particular network resource entity
During record page, the access data statistics of the whole network user can be considered, the consolidated network resource of the same website for getting
The corresponding multiple starting point webpages of entity, the statistics according to the whole network is screened, so as to meet the starting point net of screening conditions
Page is defined as catalogue page of the correspondence particular network resource entity in correspondence website, further increases the standard for determining catalogue page method
True property.
Described above is only the general introduction of technical solution of the present invention, in order to better understand technological means of the invention,
And can be practiced according to the content of specification, and in order to allow the above and other objects of the present invention, feature and advantage can
Become apparent, below especially exemplified by specific embodiment of the invention.
Brief description of the drawings
In order to illustrate more clearly about the embodiment of the present invention or technical scheme of the prior art, below will be to institute in embodiment
The accompanying drawing for needing to use is briefly described, it should be apparent that, drawings in the following description are only some implementations of the invention
Example, for those of ordinary skill in the art, on the premise of not paying creative work, can also obtain according to these accompanying drawings
Obtain other accompanying drawings.In the accompanying drawings:
Fig. 1 is the method flow diagram of identification Internet resources entity catalogue page according to an embodiment of the invention;
Fig. 2 is the schematic device of identification Internet resources entity catalogue page according to an embodiment of the invention.
Specific embodiment
Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete
Site preparation is described, it is clear that described embodiment is only a part of embodiment of the invention, rather than whole embodiments.It is based on
Embodiment in the present invention, the every other embodiment that those of ordinary skill in the art are obtained belongs to present invention protection
Scope.
In embodiments of the present invention, it is to access the webpage produced during webpage according to user to redirect path to find net
The catalogue page of network resource entity.Its principle is:User is during Internet resources entity is accessed, it will usually from catalogue page point
Go out to specific chapters and sections page, and seldom can return to catalogue page from specific chapters and sections page, according to this feature, it is possible to by system
Meter analysis user obtains the catalogue page of Internet resources entity to an access path for Internet resources entity.Below to specific
Implementation process is introduced in detail.
Referring to Fig. 1, the embodiment of the present invention provide firstly a kind of method for recognizing Internet resources entity catalogue page, the method
May comprise steps of:
S110:Acquisition user is browsed during webpage, points out the mistake of the actual resource webpage related to Internet resources entity
Journey information;
Generally, during user uses the various webpages of browser access, browser or its plug-in unit can be with
The forms such as travel log record access situation of the user to webpage.For example, user is after certain opening browser, first at one
Certain webpage A is opened in Shipping Options Page, is linked by certain in webpage clicking A afterwards, open certain webpage B, etc., browser or
It is relevant with the webpage that user browses that person its plug-in unit just can respectively record address, web page title of each webpage of access etc.
Information, in addition, for being similar to this webpages opened by way of link is redirected of webpage B, can also record the webpage is
From which webpage(Referred to as referer)In link redirect, etc..In this manner it is possible to record each be accessed by the user
The webpage crossed, and the webpage the information such as URL addresses, title, referer.For the ease of description, will can be accessed by the user
The webpage crossed regards point one by one as, and certainly, the webpage for now getting all is some discrete points, that is to say, that now,
Simply known browsed which webpage of user, and each webpage some information(Address, title including webpage, reference
Page address etc.), but sequencing that relation namely each webpage between each point are clicked etc. not yet knows, follow-up step
Need to reduce the access track of user in rapid, so that each point is constituted into track according to clicking operation relation, and therefrom
Find out the start page that entity is accessed on track.In the webpage that user is browsed, a portion is and Internet resources entity phase
Close, only extract this part webpage, can just judge which or which webpage is the catalogue of Internet resources entity
Page.
Specifically in the webpage that extraction is related to particular network resource entity, there can be various ways.For example, one of which
Under mode, the physical name of some Internet resources entities can be in advance obtained, constitute physical name set, accessed user is got
Webpage information after, the web page title on each aspect can be taken out, if in title exist and physical name set in certain
The information that one physical name matches, then the point is exactly the point related to Internet resources entity is accessed.And then just can be by this point
Extract.Certainly, in the webpage accessed by a user, it is understood that there may be multiple different Internet resources entities, now,
Can respectively extract which point each different physical name corresponds to respectively, and then when entity access track is reduced, also all be
Reduced for same entity respectively.
For example, the webpage that user accessed has A, B, C, D, E, F, by judging to find, in the title of B, C, D therein all
Containing certain physical name, then can be using B, C, D as the point related to the Internet resources entity.If containing another in E, F therein
One physical name, then can be using E, F as point related to another Internet resources entity, etc..In a word, for each entity
Name, can count the set of multiple point compositions, and these points may not be to be recorded in a navigation process, but
Can be integrated into being counted together.
Wherein, in order to whether in the title for judging each point, comprising physical name, a physical name collection can be pre-build
Close, wherein comprising multiple physical names.In this manner it is possible to directly judge each point title in whether there is with physical name set in
Information match content, if it is present comprising physical name in may certify that the title of the point.Setting up physical name set
When, can be the Anchor Text of hyperlink in the known Internet resources entity index page of crawl by the way of, to Anchor Text
Noise reduction filtering is carried out, physical name is extracted from Anchor Text.That is, in webpage, the link of Internet resources entity typically can
Using physical name as Anchor Text.For example, there are many links of TV play in certain navigation website, each link here is general
All it is that, so as to user's identification, this navigation website is commonly referred to as Internet resources reality directly by the name referred to as Anchor Text of TV play
The index page of body.After the link that user clicks on certain TV play in the index page, it is possible to enter the catalogue page of the TV play, its
In show it is each collection TV play broadcasting entrance.Therefore, by extracting the Anchor Text of this link, it is possible to get entity
Name.That is, the content of some specific Internet resources entity index pages can be captured, and therefrom extract the anchor text for including
This, the Anchor Text for then grabbing these is collected as the physical name of Internet resources entity.So, it is a large amount of by collecting
Physical name, by duplicate removal, noise reduction etc. process after can just set up physical name set, for follow-up to user data
Screened, or each address on webpage access track is screened.Specifically, the physical name that will be got and user
The title of the webpage pointed out, and the title of referer is matched, if the title of the webpage that user points out, or referer
Contain certain physical name in title, then it is assumed that be the address relevant with Internet resources entity and retain, otherwise filter the address.
User can equally record Internet resources reality in the actual resource webpage that access is related to Internet resources entity
The address of the actual resource webpage that body phase is closed, the address of corresponding referer, further, it is also possible to obtain described in actual resource webpage
The information such as website, and then these information, the process to customer access network resource entity can be utilized to reduce.Obtaining
During user points out the procedural information of actual resource webpage, can first obtain during user browses webpage and be pointed out
The address of webpage, and referer corresponding with pointed out webpage address;And then with the advance physical name for obtaining, and/or in fact
Body resource address, the address of webpage is pointed out to user, and the address of referer is filtered, obtain the address pointed out of user with
And in the referer with the physical name, and/or actual resource addresses match address.
Such as, the actual resource address related to Internet resources entity can in advance be obtained.The specific method for obtaining can be with
It is any one or a few the combination in the following manner:
(1) the HTML html tag code of the hyperlink in known navigation page, extracts entity money
Various webpages can typically be carried out taxonomic revision, accordingly, it is possible to root by source address in having the page of navigation feature due to some
According to the HTML of the navigation page of known particular category(Hypertext Markup Language, HTML)Label,
Extract the banner of particular category.For example, the address of the webpage for novel class, can be by capturing " http://
123.sogou.com/xiaoshuo/”(The novel class navigation page of search dog)Html in, label<a>Under all url conducts
The address of novel class.Certainly, for same category, can also be captured from multiple different navigation pages, and gone
Process again, finally give the address of novel class webpage;
(2) address comprising particular keywords is obtained as actual resource address from the web page storage folder of user;As used
The network address collected in the collection of family, its title contains " caricature " one word, then using the network address as an actual resource address;
The web page storage folder of user can include local collection, network profile;
(3) whether judge in the directory name of user's webpage collection comprising particular keywords, if extracting catalogue comprising if
In address as the actual resource address;For example include " caricature " one word in the directory name of user's collection, then should
Collection network address under catalogue is used as actual resource address.The web page storage folder of user can include local collection, network collection
Folder etc..The webpage of collection can typically be carried out classification collection, namely collection by some users in the collection using browser
In typically have multiple catalogues, each catalogue is named according to the classification of webpage, accordingly it is also possible to special from the collection of user
Name the web page address that correspondence classification is extracted under the catalogue of title.Certainly, when implementing, various types of other web page address is collected
Work be usually to carry out in server end, and the terminal device that the collection of user may be only stored in user is local, because
This, in the case of can also being allowed user is got, the information in user's collection is uploaded onto the server, for server point
Analysis is used.Certainly, in actual applications, in order to use collection on different terminal devices, user may use
Network profile function, namely the information of collection synchronous can be saved in server end, so when user is in different terminal devices
During upper use browser, can be by the synchronization that signs in the account of oneself to carry out collection.Therefore, used in user
In the case of network profile function, server end can be directly obtained the information in each user's collection, and then carry out
The extraction and collection of the web page address of particular category.
(4) site address of particular keywords in the title of website homepage is obtained as actual resource address;Such as website
In the title of homepage, if comprising keywords such as certain caricature, novels, can be using the address of the website as actual resource ground
Location;If that is, in some websites provide Internet resources entity and its catalogue page, in the title of the homepage of website,
Can typically the keyword of particular network resource entity be included, therefore the address of this website can be obtained as actual resource ground
Location.
In actual application, net can be pointed out to user using only the physical name for getting, or actual resource address
Filtered the address of page, and the address of corresponding referer, it is also possible to while according to physical name and the entity money for getting
Source address, carries out dual filtering, to obtain more preferable filter effect, while reducing the data volume of the treatment of subsequent step.
S120:The entity access track that user accesses particular network resource entity is restored according to the procedural information;
Getting during user browses webpage, pointing out the process of the actual resource webpage related to Internet resources entity
After information, the entity that can access particular network resource entity according to procedural information also original subscriber accesses track.Wherein, process letter
Breath can include the address of actual resource webpage, be related to entity equivalent to which webpage in being realised that the webpage that user points out
, but where these related webpages are usually the specific a certain specific collection of drama or a certain chapter content of Internet resources entity
Webpage.For example, after user opens the catalogue page A of certain entity, the first collection of certain serial is therefrom clicked on, in the webpage B for pointing out
The particular content of the entity first collection is browsed, now, webpage B falls within the webpage related to Internet resources entity.Therefore, connect
Get off to need the thing done is exactly to determine it may is Internet resources reality according to these webpages related to Internet resources entity
The webpage of the catalogue page of body.Therefore, in embodiments of the present invention, when can access user particular network resource entity first
Entity accesses track and is reduced, namely judges each webpage related to Internet resources entity is how to be clicked on by user
, and the starting point webpage network address on the track is determined, the webpage corresponding to this starting point webpage network address may be network money
The catalogue page of source entity, certainly, the data of unique user there may be the factors such as certain contingency, finally can also it is comprehensive its
The data of his user carry out collecting judgement, therefore the catalogue page obtained from a certain user can be temporarily as Internet resources entity
Alternative catalogue page.
Entity accesses track and can be expressed as specific tuple-set, for example, be expressed as two tuple-sets:
{(url1,refer1),(url2,refer2)……(urli,referi)…}
The wherein address of url presentation-entity resource webpage;Refer represents referer when pointing out the actual resource webpage
Address, the subscript of the address url and reference page address refer of actual resource webpage can not indicate that the access of webpage is successively suitable
Sequence.In embodiments of the present invention for the ease of the catalogue page of acquisition Internet resources entity, and recognize different websites, different nets
The catalogue page of network actual resource, entity accesses track and can be expressed as specific four-tuple set, wherein, specifically visited to user
When entity access track when asking particular network resource entity is reduced, the point of various discrete can respectively be used into quaternary first
Group is represented, for example, this four-tuple can be expressed as (url, refer, entity, site), wherein, url presentation-entity money
The address of source web page;Refer represents the address of referer when pointing out the actual resource webpage;Entity represents Internet resources reality
The mark of body, can be the title of Internet resources entity, the acute name of such as serial, the title of caricature, novel name etc.;Site generations
Website belonging to table actual resource webpage.For example, for certain webpage B accessed by certain user, certain is contained in the title of webpage B
Physical name M, and B is redirected after certain link that user is clicked in A, then and in the four-tuple of B node, url is just
It is B webpages network address in itself, refer is exactly the network address of A webpages, and entity can be represented by physical name M, such as " sea thief
King ", " fiery shadow person of bearing " etc., site is exactly the website belonging to webpage B.
After so by each node quadruple notation, can be according to the corresponding Internet resources entity of actual resource webpage
And affiliated website, actual resource webpage is divided into multiple subsets;Wherein, in each subset comprising under same website with it is same
The related multiple actual resource webpages of one Internet resources entity;In same subset, according to the address of each actual resource webpage
And the address of referer, restore the entity access track that user accesses map network resource entity under correspondence website.User
Entity when consolidated network resource entity is accessed under same website accesses track and can be expressed as:
Set(same(entity,site)){(url1,refer1),(url2,refer2)…..(urli,referi)..}
The subscript of the address url and reference page address refer of above-mentioned each actual resource webpage is not represented between webpage
Order information, namely lower target order can not represent access order of the user to each actual resource webpage, now represent
Each point still can be discrete.
According to the corresponding Internet resources entity of actual resource webpage and affiliated website, actual resource webpage is divided into
During multiple subsets, the reality can be matched using the method for matching most long with the physical name of the advance Internet resources entity for obtaining
The actual resource webpage is divided into multiple subsets by the title of body resource webpage, the result according to matching.For example have " by liking "
" love is carried through to the end " two physical names, if not matched in the method for matching most long now, are carried out with " by liking "
During matching, then the two physical names can be matched, this is not corresponded with actual situation, and passes through the method for most growing matching
The title of the actual resource webpage is matched, then can be very good to distinguish so physical name with character inclusion relation
Come.
S130:The starting point web page address that the entity is accessed on track is obtained, rising on track is accessed according to the entity
Point web page address, determines the catalogue page of the particular network resource entity.
The starting point web page address that entity is accessed on track is obtained, the particular network resource reality is determined according to starting point web page address
During the catalogue page of body, can specifically be handled as follows:
Accessed on track in an entity, the corresponding page address of quoting of contrast target entity resource webpage is accessed with the entity
The address of other actual resource webpages on track, if target entity resource webpage it is corresponding quote page address with any one its
His address of actual resource webpage is identical, then the actual resource webpage is defined as into the non-starting point webpage on entity access track,
And the actual resource webpage is deleted from access track;
Previous step is repeated, until the entity is accessed on track there is no the corresponding reference of any actual resource webpage
Page address is identical with the address of other actual resource webpages;
The corresponding referer of remaining actual resource webpage on entity access track is defined as into entity to access on track
Starting point webpage.This process is specifically described below.
Accessed on track in an entity, as url in arbitrary two 2 tuplei=referjWhen, then delete and include referj
Binary pair.That is, it is assumed that the webpage related to certain entity has A, B, C, D, wherein, A is that certain search results pages X is pointed out
, B is pointed out from the link in webpage A, and C is pointed out from the link in webpage B, and D is pointed out from the link in webpage C
's.Same physical name is all included in B, C, D therein, also, these three webpages belong to same website, then now,(url1=
B, refer1=A),(url2=C, refer2=B),(url3=D, refer3=C), it is seen then that url1=refer2, url2=refer3, because
This, B, C, D belong to the webpage that same entity is accessed on track, and the user can be accessed the track reduction of the Internet resources entity
It is A->B->C->D, and can be refer by source page2、refer3Point C, D delete, refer pairs of the point B for retaining
The webpage A for answering can serve as the starting point webpage that entity is accessed on track, and as an alternative catalogue page, with
This analogizes.So, for same user, under the same website in each webpage related to same entity, it is possible to extract
Multiple alternative catalogue pages.
That is, for same user, in its access process can group with Internet resources entity related web page
Into a webpage complete or collected works I, after the collections of web pages M related to Internet resources entity is therefrom filtered out, can be by set M
Webpage is divided into multiple subset Mi according to website(Wherein, i=1,2,3 ..., maximum occurrences are the total of the website that includes in set M
Number), then again in each subset Mi, multiple subset Mij can be further subdivided into according to corresponding entity again(J=1,2,3 ..., most
Big value is the number of entities included in set Mi), this, included in each subset Mij be exactly same website under with it is same
The related webpage of one Internet resources entity;Then, in each subset Mij, each webpage is come in the form of foregoing binary pair
Represent, also, if the source page refer of certain binary centering is identical with the url of another binary centering, then by this source net
Page is not necessarily just that entity accesses starting point webpage on track to corresponding webpage for the binary of refer, therefore can will be this
Binary so, eventually causes to retain next part binary pair in each subset Mij to deleting, what this part was retained
Binary pair, it is possible to as the alternative catalogue page of map network resource entity under correspondence website.
Certainly, in actual applications, there can also be the method for other reduction entity access path, such as with reference to each net
The access time information of page reduce, etc..
By above-mentioned steps, for same user, its each Internet resources entity for accessing can be counted
Alternative catalogue page under each website, it is similar, for other users, also can respectively count what it was accessed
Alternative catalogue page of each Internet resources entity under each website, in this manner it is possible to standby by what is counted from these users
Select catalogue page to be collected, and finally determine the catalogue page of Internet resources entity.For example, specifically when collecting, can obtain
Corresponding more than two entities of consolidated network resource entity of same website access the multiple starting point webpages on track, and each is used
Family is represented group with following binary under same website for the alternative catalogue page of consolidated network resource entity:
Set(same(entity,site)){user1(urli,referi),user2(urli,referi),user3
(urli,referi)....us ern(urli,referi)}
Then count each starting point webpage occurrence number in the multiple starting point webpage respectively, will appear from number of times highest or
Proportion meets the starting point webpage of prerequisite and is defined as the catalogue of the Internet resources entity under the website in total degree
Page.For example, will appear from alternative catalogue page of the number of times station total degree ratio more than 50% as the Internet resources entity under the website
Catalogue page, etc..
For example:Assuming that being the browser experiences webpage that a user accessed in the works shown in table 1 below:
Table 1
url | Refer |
http://www.dm5.com/m129008/ | http://www.dm5.com/manhua-area-d-yinenglingyu/ |
http://www.dm5.com/m129008-p2/ | http://www.dm5.com/m129008/ |
http://news.baidu.com/ | http://www.baidu.com/ |
http://www.dm5.com/m129008-p3/ | http://www.dm5.com/m129008-p2/ |
http://www.dm5.com/m129008-p4/ | http://www.dm5.com/m129008-p3/ |
…… | …… |
Wherein, http://news.baidu.com/ and http://www.baidu.com/ be Internet resources entity without
The webpage of pass, therefore, it is deleted from collections of web pages.In other each webpages, http://www.dm5.com/
m129008/、http://www.dm5.com/m129008-p2/、http://www.dm5.com/m129008-p3/、
http://www.dm5.com/m129008-p4/ is the webpage related to caricature entity " the different energy fields of Area D " is accessed.And
And the path for accessing the entity can be reduced to:
http://www.dm5.com/manhua-area-d-yinenglingyu/
->http://www.dm5.com/m129008/
->http://www.dm5.com/m129008-p2/
->http://www.dm5.com/m129008-p3/
……
And then, the starting point webpage " http on the path can be taken://www.dm5.com/manhua-area-d-
Yinenglingyu/ " alternately catalogue pages.Likewise, the collections of web pages that other users were accessed, also can be according to above-mentioned
Method is processed, and is so directed to consolidated network resource entity, it is possible to counted in the collections of web pages of each user respectively
Alternative catalogue page under each website, e.g., according to each user to the access path of caricature entity " Area D different can field "
Can obtain, the alternative catalogue page of the caricature is as shown in table 2:
Table 2
http://www.dm5.com/m115189-p5/ |
http://www.dm5.com/m115725/ |
http://www.dm5.com/manhua-area-d-yinenglingyu/ |
http://www.dm5.com/manhua-area-d-yinenglingyu/ |
http://www.dm5.com/manhua-area-d-yinenglingyu/ |
http://www.dm5.com/manhua-area-d-yinenglingyu/ |
http://www.dm5.com/manhua-area-d-yinenglingyu/ |
http://www.dm5.com/manhua-area-d-yinenglingyu/ |
http://www.dm5.com/manhua-area-d-yinenglingyu/ |
http://www.dm5.com/manhua-area-d-yinenglingyu/ |
http://www.imanhua.com/comic/3401/ |
http://www.imanhua.com/comic/3401/ |
http://www.imanhua.com/comic/3401/ |
http://www.imanhua.com/comic/3401/ |
http://www.imanhua.com/comic/3401/ |
http://www.imanhua.com/comic/3401/ |
http://www.imanhua.com/comic/3401/list_66230.html |
http://www.imanhua.com/comic/3401/list_66401.html?p=17 |
It can be seen that, for the caricature entity, counted under www.dm5.com, www.imanhua.com the two websites respectively
Go out multiple alternative catalogue pages, wherein, under website www.dm5.com, http://www.dm5.com/manhua-area-
The occurrence number of d-yinenglingyu/ be 8 times, and under the website alternative catalogue page sum be 10 times, therefore, it can by
http://www.dm5.com/manhua-area-d-yinenglingyu/ is defined as the caricature entity in website
Catalogue page under www.dm5.com.Similar, can be by http://www.imanhua.com/comic/3401/ is defined as this
Catalogue page of the caricature entity under website www.imanhua.com.
In addition, in order to the result further to judging is optimized, obtaining alternative from the statistics of each user
After catalogue page, these alternative catalogue pages can also be filtered first.I.e. after multiple starting point webpages are got, judge many
Whether individual starting point webpage is related to the consolidated network resource entity of same website, and incoherent starting point webpage is filtered.Specifically
When realizing, can with judging whether alternative catalogue page is to search for the related network address in website, or website homepage network address etc., if
Be can then determine this/some alternative catalogue pages are uncorrelated to physical network resource, and then by these it is uncorrelated fall alternative catalogue
Page is filtered.
In a word, in embodiments of the present invention, user can be browsed during webpage, is pointed out related to Internet resources entity
The procedural information of actual resource webpage obtained, according to this procedural information, also original subscriber accesses particular network resource entity
Entity access track, and therefrom determine entity access track starting point web page address, finally according to this starting point webpage ground
The catalogue page of Internet resources entity is determined in location, and the embodiment of the present invention identify in multiple web pages how to be comformed automatically by program
The catalogue page of Internet resources entity provides a kind of practicable implementation, and the process of identification is not need to rely on catalogue page
Text feature etc., the catalogue page to Internet resources entity recognizes efficiently and accurately, and scalability is stronger.
Method with identification Internet resources entity catalogue page provided in an embodiment of the present invention is corresponding, and the embodiment of the present invention is also
There is provided a kind of device for recognizing Internet resources entity catalogue page, referring to Fig. 2, the device can specifically include:
Procedural information acquiring unit 210, for obtaining during user browses webpage, points out related to Internet resources entity
Actual resource webpage procedural information;
Track reduction unit 220 is accessed, particular network resource entity is accessed for restoring user according to procedural information
Entity accesses track;
Catalogue page acquiring unit 230, for obtaining the starting point web page address on entity access track, rail is accessed according to entity
Starting point web page address on mark, determines the catalogue page of the particular network resource entity.
Wherein, the procedural information for accessing the actual resource webpage related to Internet resources entity can include actual resource net
The address of the website belonging to page, the address of actual resource webpage, and referer when pointing out actual resource webpage;
Under this implementation, accessing track reduction unit 220 can include:
Subset division subelement, for according to the corresponding Internet resources entity of actual resource webpage and affiliated website,
Actual resource webpage is divided into multiple subsets;Wherein, in each subset comprising under same website with consolidated network resource entity
Related multiple actual resource webpages;
Track also atomic unit is accessed, in same subset, address according to each actual resource webpage and is drawn
With the address of page, restore the entity that user accessed under correspondence website to map network resource entity and access track;
Catalogue page acquiring unit 230 can include:
Subelement is deleted in contrast, and for an entity access track, contrasting, target entity resource webpage is corresponding to be drawn
The address of other actual resource webpages on track is accessed with page address and the entity, if target entity resource webpage is corresponding drawn
Address with page address and any one other actual resource webpages is identical, then the actual resource webpage is defined as into entity accesses
Non- starting point webpage on track, and the actual resource webpage is deleted from access track;
Loop control subelement, repeats, for controlling contrast to delete subelement until the entity is accessed on track not
Any further actual resource webpage is corresponding, and to quote page address identical with the address of other actual resource webpages;
Starting point webpage determination subelement, for the entity to be accessed into the corresponding reference of remaining actual resource webpage on track
Page is defined as the starting point webpage that entity is accessed on track.
Under another implementation, subset division subelement specifically can be used for:
With the physical name of the advance Internet resources entity for obtaining, using the method matching entities resource webpage of matching most long
Actual resource webpage is divided into multiple subsets by title, the result according to matching.
Catalogue page acquiring unit 230, specifically can be used for:
Obtain multiple starting points that corresponding more than two entities of consolidated network resource entity of same website are accessed on track
Webpage;
Each starting point webpage occurrence number in multiple starting point webpages is counted respectively, and will appear from number of times meet prerequisite
Starting point webpage, is defined as catalogue page of the correspondence particular network resource entity in correspondence website.
Under this implementation, the device of the identification Internet resources entity catalogue page can also include:
Starting point home page filter unit, for after multiple starting point webpages are got, judge multiple starting point webpages whether with together
The consolidated network resource entity of one website is related, and incoherent starting point webpage is filtered.
Additionally, procedural information acquiring unit 210 can include:
User clicks on address acquisition subelement, points out the address of webpage during user browses webpage for obtaining,
And the address of referer corresponding with pointed out webpage;
User clicks on address filtering subelement, for the advance physical name for obtaining, and/or actual resource address, to
The address of webpage is pointed out at family, and the address of referer is filtered, address and the address of referer that acquisition user points out
In with physical name, and/or actual resource addresses match address.
Under this implementation, the device can also include actual resource address acquisition unit, for by with lower section
Formula obtains actual resource address:
The HTML html tag code of the hyperlink in known navigation page, extracts actual resource ground
Location;
And/or,
The address comprising particular keywords is obtained as actual resource address from the web page storage folder of user;
And/or,
Judge whether include particular keywords in the directory name of user's webpage collection, in catalogue is extracted comprising if
Address is used as actual resource address;
And/or,
The site address of particular keywords in the title of website homepage is obtained as actual resource address.
Additionally, under another implementation, the device of the identification Internet resources entity catalogue page can also include entity
Name acquiring unit, for obtaining physical name in the following manner:
The Anchor Text of hyperlink in the known Internet resources entity index page of crawl;
Noise reduction filtering is carried out to Anchor Text, physical name is extracted from Anchor Text.
In the said apparatus of the embodiment of the present invention, user can be browsed during webpage, pointed out and Internet resources reality
The procedural information of the actual resource webpage that body phase is closed is obtained, and particular network money is accessed according to this procedural information also original subscriber
The entity of source entity accesses track, and therefrom determines that entity accesses the starting point web page address of track, finally according to this starting point
Web page address determines the catalogue page of Internet resources entity, and how the embodiment of the present invention to be comformed in multiple web pages automatically by program
Identify that the catalogue page of Internet resources entity provides a kind of practicable implementation, the process of identification is not need to rely on
Text feature of catalogue page etc. such that it is able to efficiently and accurately identify the catalogue page of Internet resources entity.
As seen through the above description of the embodiments, those skilled in the art can be understood that the present invention can
Realized by the mode of software plus required general hardware platform.Based on such understanding, technical scheme essence
On the part that is contributed to prior art in other words can be embodied in the form of software product, the computer software product
Can store in storage medium, such as ROM/RAM, magnetic disc, CD, including some instructions are used to so that a computer equipment
(Can be personal computer, server, or network equipment etc.)Perform some of each embodiment of the invention or embodiment
Method described in part.
Each embodiment in this specification is described by the way of progressive, identical similar portion between each embodiment
Divide mutually referring to what each embodiment was stressed is the difference with other embodiment.Especially for device or
For system embodiment, because it is substantially similar to embodiment of the method, so describing fairly simple, related part is referring to method
The part explanation of embodiment.Apparatus and system embodiment described above is only schematical, wherein the conduct
Separating component explanation unit can be or may not be it is physically separate, the part shown as unit can be or
Person may not be physical location, you can with positioned at a place, or can also be distributed on multiple NEs.Can be with root
Some or all of module therein is factually selected the need for border to realize the purpose of this embodiment scheme.Ordinary skill
Personnel are without creative efforts, you can to understand and implement.
Above to the method and device of identification Internet resources entity catalogue page provided by the present invention, detailed Jie has been carried out
Continue, specific case used herein is set forth to principle of the invention and implementation method, the explanation of above example is only
It is to be used to help understand the method for the present invention and its core concept;Simultaneously for those of ordinary skill in the art, according to this hair
Bright thought, will change in specific embodiments and applications.In sum, this specification content should not be managed
It is limitation of the present invention to solve.
Claims (14)
1. it is a kind of recognize Internet resources entity catalogue page method, it is characterised in that including:
Acquisition user is browsed during webpage, points out the procedural information of the actual resource webpage related to Internet resources entity;
The entity access track that user accesses particular network resource entity is restored according to the procedural information;
The starting point web page address that the entity is accessed on track is obtained, the starting point webpage ground on track is accessed according to the entity
Location, determines the catalogue page of the particular network resource entity, specifically includes:
Obtain multiple starting point webpages that corresponding more than two entities of consolidated network resource entity of same website are accessed on track;
Each starting point webpage occurrence number in the multiple starting point webpage is counted respectively, and will appear from number of times meet prerequisite
Starting point webpage, is defined as catalogue page of the correspondence particular network resource entity in correspondence website.
2. method according to claim 1, it is characterised in that the procedural information is included belonging to the actual resource webpage
Website, the address of the actual resource webpage, and referer when pointing out the actual resource webpage address;
It is described that the entity access track that user accesses particular network resource entity is restored according to the procedural information, including:
According to the corresponding Internet resources entity of the actual resource webpage and affiliated website, the actual resource webpage is drawn
It is divided into multiple subsets;Wherein, comprising multiple entities money related to consolidated network resource entity under same website in each subset
Source web page;
In same subset, the address of address and the referer according to each actual resource webpage restores user's visit
Ask and track is accessed to the entity of map network resource entity under correspondence website;
The starting point web page address obtained on the entity access track, including:
Accessed on track in an entity, the corresponding page address of quoting of contrast target entity resource webpage is accessed with the entity
The address of other actual resource webpages on track, if target entity resource webpage it is corresponding quote page address with any one its
His address of actual resource webpage is identical, then the actual resource webpage is defined as into the non-starting point webpage on entity access track,
And the actual resource webpage is deleted from the access track;
Previous step is repeated, until the entity is accessed on track there is no the corresponding referer ground of any actual resource webpage
Location is identical with the address of other actual resource webpages;
The corresponding referer of remaining actual resource webpage on entity access track is defined as into the entity to access on track
Starting point webpage.
3. method as claimed in claim 2, it is characterised in that described according to the corresponding Internet resources of the actual resource webpage
Entity and affiliated website, multiple subsets are divided into by the actual resource webpage, including:
With the physical name of the advance Internet resources entity for obtaining, the actual resource webpage is matched using the method for matching most long
The actual resource webpage is divided into multiple subsets by title, the result according to matching.
4. method according to claim 1, it is characterised in that also include:
After the multiple starting point webpage is got, judge the multiple starting point webpage whether the same net with the same website
Network resource entity is related, and incoherent starting point webpage is filtered.
5. the method according to claim any one of 1-4, it is characterised in that the acquisition user is browsed during webpage,
The procedural information of the actual resource webpage related to Internet resources entity is pointed out, including:
Obtain and the address of webpage pointed out during user browses webpage, and referer corresponding with pointed out webpage ground
Location;
With the advance physical name for obtaining, and/or actual resource address, the address of webpage, and the referer are pointed out to user
Address filtered, obtain in the address of the address pointed out of user and the referer with the physical name, it is and/or described
The address of actual resource addresses match.
6. method according to claim 5, it is characterised in that also include, the entity is obtained beforehand through in the following manner
Resource address:
The HTML html tag code of the hyperlink in known navigation page, extracts the actual resource ground
Location;
And/or,
The address comprising particular keywords is obtained as the actual resource address from the web page storage folder of user;
And/or,
Judge whether include particular keywords in the directory name of user's webpage collection, the address in catalogue is extracted comprising if
As the actual resource address;
And/or,
The site address of particular keywords in the title of website homepage is obtained as the actual resource address.
7. method according to claim 5, it is characterised in that obtain the physical name beforehand through in the following manner:
The Anchor Text of hyperlink in the known Internet resources entity index page of crawl;
Noise reduction filtering is carried out to the Anchor Text, the physical name is extracted from the Anchor Text.
8. it is a kind of recognize Internet resources entity catalogue page device, it is characterised in that including:
Procedural information acquiring unit, for obtaining during user browses webpage, points out the entity related to Internet resources entity
The procedural information of resource webpage;
Track reduction unit is accessed, for restoring the entity that user accesses particular network resource entity according to the procedural information
Access track;
Catalogue page acquiring unit, for obtaining the starting point web page address on the entity access track, accesses according to the entity
Starting point web page address on track, determines the catalogue page of the particular network resource entity, specifically for:
Obtain multiple starting point webpages that corresponding more than two entities of consolidated network resource entity of same website are accessed on track;
Each starting point webpage occurrence number in the multiple starting point webpage is counted respectively, and will appear from number of times meet prerequisite
Starting point webpage, is defined as catalogue page of the correspondence particular network resource entity in correspondence website.
9. device according to claim 8, it is characterised in that the procedural information is included belonging to the actual resource webpage
Website, the address of the actual resource webpage, and referer when pointing out the actual resource webpage address;
Access track reduction unit, including:
Subset division subelement, for according to the corresponding Internet resources entity of the actual resource webpage and affiliated website,
The actual resource webpage is divided into multiple subsets;Wherein, in each subset comprising under same website with consolidated network resource
The related multiple actual resource webpages of entity;
Track also atomic unit is accessed, in same subset, address according to each actual resource webpage and described is drawn
With the address of page, restore the entity that user accessed under correspondence website to map network resource entity and access track;
The catalogue page acquiring unit, including:
Subelement is deleted in contrast, and for an entity access track, contrasting, target entity resource webpage is corresponding to be drawn
The address of other actual resource webpages on track is accessed with page address and the entity, if target entity resource webpage is corresponding drawn
Address with page address and any one other actual resource webpages is identical, then the actual resource webpage is defined as into entity accesses
Non- starting point webpage on track, and the actual resource webpage is deleted from the access track;
Loop control subelement, repeats, for controlling the contrast to delete subelement until the entity is accessed on track not
Any further actual resource webpage is corresponding, and to quote page address identical with the address of other actual resource webpages;
Starting point webpage determination subelement, for the corresponding referer of remaining actual resource webpage on entity access track is true
It is set to the starting point webpage that the entity is accessed on track.
10. device as claimed in claim 9, it is characterised in that the subset division subelement, specifically for:
With the physical name of the advance Internet resources entity for obtaining, the actual resource webpage is matched using the method for matching most long
The actual resource webpage is divided into multiple subsets by title, the result according to matching.
11. devices according to claim 8, it is characterised in that also include:
Starting point home page filter unit, for after the multiple starting point webpage is got, whether judging the multiple starting point webpage
Consolidated network resource entity to the same website is related, and incoherent starting point webpage is filtered.
12. device according to claim any one of 8-11, it is characterised in that the procedural information acquiring unit, including:
User clicks on address acquisition subelement, points out the address of webpage during user browses webpage for obtaining, and
The address of referer corresponding with pointed out webpage;
User clicks on address filtering subelement, for the advance physical name for obtaining, and/or actual resource address, to user's point
Go out the address of webpage, and the address of the referer is filtered, obtain address that user points out and the referer
With the physical name in address, and/or the actual resource addresses match address.
13. devices according to claim 12, it is characterised in that also including actual resource address acquisition unit, for leading to
Cross in the following manner and obtain the actual resource address:
The HTML html tag code of the hyperlink in known navigation page, extracts the actual resource ground
Location;
And/or,
The address comprising particular keywords is obtained as the actual resource address from the web page storage folder of user;
And/or,
Judge whether include particular keywords in the directory name of user's webpage collection, the address in catalogue is extracted comprising if
As the actual resource address;
And/or,
The site address of particular keywords in the title of website homepage is obtained as the actual resource address.
14. devices according to claim 12, it is characterised in that also including physical name acquiring unit, for by following
Mode obtains the physical name:
The Anchor Text of hyperlink in the known Internet resources entity index page of crawl;
Noise reduction filtering is carried out to the Anchor Text, the physical name is extracted from the Anchor Text.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310589670.7A CN103605742B (en) | 2013-11-20 | 2013-11-20 | Recognize the method and device of Internet resources entity catalogue page |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310589670.7A CN103605742B (en) | 2013-11-20 | 2013-11-20 | Recognize the method and device of Internet resources entity catalogue page |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103605742A CN103605742A (en) | 2014-02-26 |
CN103605742B true CN103605742B (en) | 2017-07-04 |
Family
ID=50123964
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201310589670.7A Active CN103605742B (en) | 2013-11-20 | 2013-11-20 | Recognize the method and device of Internet resources entity catalogue page |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103605742B (en) |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106611008B (en) * | 2015-10-26 | 2020-06-12 | 中国移动通信集团公司 | Internet content label management method and device |
CN106897196B (en) * | 2015-12-17 | 2019-10-25 | 北京国双科技有限公司 | The determination method and device of access path between Website page |
CN110020064A (en) * | 2017-07-19 | 2019-07-16 | 北京国双科技有限公司 | The crawling method and device of webpage |
CN111177619B (en) * | 2019-12-19 | 2022-09-09 | 山石网科通信技术股份有限公司 | Webpage identification method and device, storage medium and processor |
CN111901450B (en) * | 2020-07-15 | 2023-04-18 | 安徽淘云科技股份有限公司 | Entity address determination method, device, equipment and storage medium |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101329687A (en) * | 2008-07-31 | 2008-12-24 | 清华大学 | Method for positioning news web page |
CN101996193A (en) * | 2009-08-21 | 2011-03-30 | 北京搜狗科技发展有限公司 | Processing method and system for expressing network resource link and internet terminal |
CN103268352A (en) * | 2013-06-03 | 2013-08-28 | 贝壳网际(北京)安全技术有限公司 | Label page display method and device and browser device |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7774366B2 (en) * | 2005-03-08 | 2010-08-10 | Salesforce.Com, Inc. | Systems and methods for implementing multi-application tabs and tab sets |
-
2013
- 2013-11-20 CN CN201310589670.7A patent/CN103605742B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101329687A (en) * | 2008-07-31 | 2008-12-24 | 清华大学 | Method for positioning news web page |
CN101996193A (en) * | 2009-08-21 | 2011-03-30 | 北京搜狗科技发展有限公司 | Processing method and system for expressing network resource link and internet terminal |
CN103268352A (en) * | 2013-06-03 | 2013-08-28 | 贝壳网际(北京)安全技术有限公司 | Label page display method and device and browser device |
Also Published As
Publication number | Publication date |
---|---|
CN103605742A (en) | 2014-02-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103914478B (en) | Webpage training method and system, webpage Forecasting Methodology and system | |
CN102831199B (en) | Method and device for establishing interest model | |
CN102622445B (en) | User interest perception based webpage push system and webpage push method | |
CN103605742B (en) | Recognize the method and device of Internet resources entity catalogue page | |
CN102831248B (en) | Network focus method for digging and device | |
CN102354315B (en) | Generation method of site navigation page and device thereof | |
JP4489994B2 (en) | Topic extraction apparatus, method, program, and recording medium for recording the program | |
CN103218431B (en) | A kind ofly can identify the system that info web gathers automatically | |
CN101369276B (en) | Evidence obtaining method for Web browser caching data | |
CN101542482B (en) | Bookmarks and ranking | |
CN102930059A (en) | Method for designing focused crawler | |
CN102521251A (en) | Method for directly realizing personalized search, device for realizing method, and search server | |
CN101996193A (en) | Processing method and system for expressing network resource link and internet terminal | |
CN103294692B (en) | A kind of information recommendation method and system | |
CN104063454A (en) | Search push method and device for mining user demands | |
CN103064880B (en) | A kind of methods, devices and systems providing a user with website selection based on search information | |
CN103279567A (en) | Web data collection method and system both based on AJAX (asynchronous javascript and extensible markup language) | |
CN109242553A (en) | A kind of user behavior data recommended method, server and computer-readable medium | |
CN102270331A (en) | Network shopping navigating method based on visual search | |
CN103530364B (en) | The method and system of download link are provided | |
CN101630330A (en) | Method for webpage classification | |
CN106446115A (en) | Mobile Internet user classification method and device | |
CN103177036A (en) | Method and system for label automatic extraction | |
JP2003076715A (en) | Method and system for retrieving web pages, program and recording medium | |
CN102811207A (en) | Network information pushing method and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |