CN103778217A

CN103778217A - Current webpage list-based method and system for recommendation

Info

Publication number: CN103778217A
Application number: CN201410024821.9A
Authority: CN
Inventors: 崔晶晶; 林佳婕; 吴鹏; 马占国; 李春华; 刘立娜
Original assignee: BEIJING GEO POLYMERIZATION TECHNOLOGY Co Ltd
Current assignee: BEIJING GEO POLYMERIZATION TECHNOLOGY Co Ltd
Priority date: 2014-01-20
Filing date: 2014-01-20
Publication date: 2014-05-07

Abstract

The invention relates to the technical field of internet application and provides a current webpage list-based method and system for recommendation. The method comprises the steps of adopting a Bloom Filter algorithm to identify an acquired URL (uniform resource locator) and judging whether the URL already exists in the presently collected webpage list. According to the current webpage list-based method and system for recommendation, webpage address information to be collected can be efficiently accurately obtained, web pages to be collected and collected web pages can be subjected to identification processing in real time, resources are fully used, and service is provided for precise content recommending and advertisement putting.

Description

The method and system that list is recommended based on current web page

Technical field

The present invention relates to technical field of internet application, particularly a kind of method and system that list is recommended based on current web page.

Background technology

In Web content recommendation/advertisement putting field, need to be to Web information, and relate in Web information, magnanimity URL is analyzed, for avoiding the URL to repeating to gather, need to distinguish the URL gathering or do not gather, due to the huge amount of URL, to distinguish need of work and expend certain room and time, prior art can have been done effectively to sentence heavily to URL and process.The technology of current utilization has following several:

1. the storage of chained list or tree and sentence double recipe formula:

Use chained list or tree storage URL, sentence heavy URL by compare operation.Realizing of this scheme is fairly simple, utilizes to search and with comparing function, URL is done the judgement whether repeating.

The storage of 2.HashTable and sentence double recipe formula:

Choose suitable Hash function, by URL expressly by Hash Function Mapping to a point in bit array, thereby can judge fast whether certain URL was crawled.

But there is following defect in prior art:

In the storage of chained list or tree with sentence in double recipe formula, the mode of chained list and tree can keep the efficiency in regular hour and space in the time that URL quantity is little.But along with the continuous expansion of URL quantity, the time of URL retrieval and the efficiency in space all can reduce, time efficiency and space efficiency are respectively O (n), O (logn), now there is the webpage of magnanimity internet, cannot meet needs of production so adopt in this way for the storage of URL and the time of retrieval and space efficiency.

In the storage of HashTable with sentence in double recipe formula, the solution of HashTable can keep O (1) in time efficiency, but Hash can produce collision problem, and the collision rate causing in order to reduce collision, need again the number of elements that can hold HashTable to limit, suppose that Hash function is good, if our bitrate length is m point, in the time that needs are for example reduced to 1% by collision rate, this HashTable just can only hold m/100 element, and obviously this has just reduced space efficiency.

Summary of the invention

(1) technical matters to be solved by this invention:

This programme uses the mode of Bloom Filter, is all better than the mode of basic chained list in space or time efficiency.Owing to adopting multiple Hash functions, so reduced collision, in space availability ratio and collision probability, be better than the mode of HashTable simultaneously.

(2) technical scheme

For achieving the above object, the present invention proposes a kind of method and system that list is recommended based on current web page.Adopt Bloom Filter to identify the URL obtaining, judge that whether Already in it in existing web page listings, can obtain website information to be collected by efficiently and accurately, and can be in real time for webpage differentiating and processing to be collected and that gathered, making full use of resource, is accurate commending contents and ad placement services.

Bloom Filter has good room and time efficiency, be used to detect the member of an element in whether gathering, its random storage organization based on a kind of high spatial utilization factor, utilizes bit array to represent a set, and can judge whether an element belongs to this set.This detection only can be to the data misjudgement in set, and not can be not set in data misjudge, " in set (possible errors) " and " not in set (absolutely not in set) " two kinds of situations have been returned in each like this detection request, visible Bloom Filter is in the time judging whether an element belongs to certain set, likely the element that does not belong to this set is mistaken for and belongs to this set (False Positive), but the element that belongs to this set can't be mistaken for and not belong to that these are several.In the demand of URL duplicate removal, as long as meet certain ratio, False Positive is acceptable.This is just for the application of Bloom Filter provides good suitable environment.

Particularly, on the one hand, the invention provides a kind of method that list is recommended based on current web page, it is characterized in that, described method comprises step:

S1: obtain current accessed URL;

S2: judge the whether collected mistake of this URL, the whether collected mistake of URL that adopts Bloom Filter algorithm identified to obtain in this step, be, turn S3, no, turn S4;

S3: inquiry URL related data, turns S6;

S4: add queue to be collected, this URL is reported to and do not crawl the collection of network address queue wait reptile instrument;

S5: obtain URL relevant information;

S6: commending contents/input advertisement.

Preferably, in step S2, after service end is received the URL reporting in step S1, adopt whether once collected mistake of Bloom Filter this webpage of algorithm identified, after obtaining a URL, calculating respectively a corresponding k bit is 0 or 1, wherein, k is the number of the hash function of algorithm use, if be all 1 on k correspondence position, think that this URL has existed in existing set, , this webpage is collected mistake, as long as having a value on correspondence position is not 1, all think that this URL is not in existing set, , this webpage does not have collected mistake.

Preferably, in step S3, this URL is sent to context database, from context database, obtain the information such as categories of websites.

Preferably, in step S5, reptile instrument never gathers the related content of obtaining URL in network address queue and crawling web page/site corresponding to this network address, submits to data-analyzing machine and carries out content analysis, and stamp relevant label to the web page/site of analyzing after content.

Preferably, in step S6, service end is chosen suitable content according to the classification of this website and other information and is recommended and throw in as media or advertisement.

On the other hand, the invention provides the system that a kind of user property based on user tag excavates, it is characterized in that, described system comprises with lower module:

M1: for obtaining current accessed URL;

M2: for judging the whether collected mistake of this URL, the whether collected mistake of URL that adopts Bloom Filter algorithm identified to obtain in this module, be, turn M3, no, turn M4;

M3: for inquiring about URL related data, turn M6;

M4: for adding queue to be collected, this URL is reported to and do not crawl the collection of network address queue wait reptile instrument;

M5: for obtaining URL relevant information;

M6: for commending contents/input advertisement.

Preferably, in module M2, after service end is received the URL reporting in step S1, adopt whether once collected mistake of Bloom Filter this webpage of algorithm identified, after obtaining a URL, calculating respectively a corresponding k bit is 0 or 1, wherein, k is the number of the hash function of algorithm use, if be all 1 on k correspondence position, think that this URL has existed in existing set, , this webpage is collected mistake, as long as having a value on correspondence position is not 1, all think that this URL is not in existing set, , this webpage does not have collected mistake.

Preferably, in module M3, this URL is sent to context database, from context database, obtain the information such as categories of websites.

Preferably, in module M5, reptile instrument never gathers the related content of obtaining URL in network address queue and crawling web page/site corresponding to this network address, submits to data-analyzing machine and carries out content analysis, and stamp relevant label to the web page/site of analyzing after content.

Preferably, in module M6, service end is chosen suitable content according to the classification of this website and other information and is recommended and throw in as media or advertisement.

(3) technique effect

The present invention is used for obtaining web page listings, can effectively realize removing duplicate webpages.

The present invention uses Bloom Filter algorithm to obtain current web page list, can obtain website information to be collected by efficiently and accurately.

The present invention can process webpage to be collected and that gathered in real time, makes full use of resource, is accurate commending contents and ad placement services.

Accompanying drawing explanation

Fig. 1 is the method flow schematic diagram that in the present invention, list is recommended based on current web page;

Fig. 2 is the system architecture schematic diagram that in the present invention, list is recommended based on current web page.

Embodiment

Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out to clear, complete description, obviously, described embodiment is a part of embodiment of the present invention, rather than whole embodiment.Based on the embodiment in the present invention, the every other embodiment that those of ordinary skills obtain under the prerequisite of not making creative work, belongs to the scope of protection of the invention.

For solving the above-mentioned defect of prior art, the invention provides a kind of method and system that list is recommended based on current web page, by adopting Bloom Filter algorithm to the current execution discriminator that obtains webpage URL, to carry out different processing for the URL that belongs to different sets respectively, thereby obtain higher time efficiency and better space efficiency.

Bloom Filter is a kind of binary vector data structure, it has good room and time efficiency, be used to detect the member of an element in whether gathering, this detection only can be to the data misjudgement in set, and can be to not being that data in set are misjudged, " in set (possible errors) " and " not within gathering (absolutely not gather in) " two kinds of situations have been returned in each like this detection request.As needs judge an element be in a set, our common way is that all elements is preserved, then by relatively knowing that it is in set, chained list, tree are all based on this thinking, when the change of set interior element number large, the room and time that we need is all linear becomes large, and retrieval rate is also more and more slower.Bloom Filter adopt be the method for hash function, by a point on the array of an element map to m length, in the time that this point is 1, so this element set in, otherwise not set in.In order to solve the collision collision problem in common hash algorithm, in Bloom Filter algorithm, use corresponding k the point of k hash function, if institute is a little all 1, element is in set so, if having 0, element is not in set.The advantage of Bloom filter be exactly its insertion and query time be all constant, its searching elements is not but preserved element itself in addition, has good security.Its shortcoming is also apparent, and when the element inserting is more, the probability of misjudgement " in set " (False Positive) is just larger.But due in the demand of URL duplicate removal, as long as meet certain ratio, False Positive is acceptable.This is just for the application of Bloom Filter provides good suitable environment.

Based on this, in one embodiment of the invention, as shown in Figure 1, the method that list is recommended based on current web page mainly comprises step:

S1: obtain current accessed URL.

In the time of user's accessed web page, browser reports webpage URL to service end, can obtain by the modes such as plug-in unit or access log are installed the URL of current web page, and this step can be used the existing URL of obtaining technology.

S2: judge the whether collected mistake of this URL.If so, go to step S3; Otherwise go to step S4.

After service end is received the URL reporting in step S1, adopt whether once collected mistake of Bloom Filter this webpage of algorithm identified.

Particularly, after obtaining a URL, judge, suppose and use k hash function, calculating respectively a corresponding k bit is 0 or 1, if be all 1 on k correspondence position, thinks that this URL has existed in existing set,, collected mistake of this webpage.As long as having a value on correspondence position is not 1, all think that this URL is not in existing set, that is, this webpage does not have collected mistake.Along with the insertion of element, it is many that the value of revising in Bloom Filter becomes, the possibility of collision conflict is just larger, in the time newly arriving an element, meet its condition in set, all corresponding positions are all 1, so just may have two kinds of situations, once being this element in set, do not judge by accident; Also have a kind of situation to judge by accident exactly, occurred Hash collision, this element is not originally in set.Now, occur that the probability of judging by accident also becomes large thereupon.But compared to the algorithm of the single Hash function of existing use, can greatly reduce collision conflict with erroneous judgement problem and can effectively improve space efficiency.And in the demand of URL duplicate removal, as long as meet certain ratio, this False Rate is acceptable.

S3: inquiry URL related data.This URL is sent to context database, from context database, obtain the information such as categories of websites, go to step S6.

S4: add queue to be collected.This URL is reported to and do not crawl the collection of network address queue wait reptile instrument.

S5: obtain URL relevant information.Reptile instrument never gathers the related content of obtaining URL in network address queue and crawling web page/site corresponding to this network address, submit to data-analyzing machine and carry out content analysis, and stamp relevant label to the web page/site of analyzing after content, as, the division Type of website is the different Types of website such as shopping website, consulting website, news website.

S6: commending contents/input advertisement.Service end is chosen suitable content according to the classification of this website and other information and is recommended and throw in as media or advertisement.

One of ordinary skill in the art will appreciate that, the all or part of step realizing in above-described embodiment method is can carry out the hardware that instruction is relevant by program to complete, described program can be stored in a computer read/write memory medium, this program is in the time carrying out, comprise each step of above-described embodiment method, and described storage medium can be: ROM/RAM, magnetic disc, CD, storage card etc.Therefore, relevant technical staff in the field will be understood that corresponding with method of the present invention, and the present invention also comprises a kind of system that list is recommended based on current web page simultaneously, as shown in Figure 2, with said method step correspondingly, this system comprises:

Acquisition module, for obtaining current accessed URL.

In the time of user's accessed web page, this acquisition module drives browser to report webpage URL to service end, can obtain by the modes such as plug-in unit or access log are installed the URL of current web page, now can use the existing URL of obtaining technology.

Judge module, for judging the whether collected mistake of this URL.If so, process this URL by enquiry module; Otherwise, turn by acquisition module and process this URL.

After service end is received the URL obtaining in submodule M1, adopt whether once collected mistake of Bloom Filter this webpage of algorithm identified.

Enquiry module, for inquiring about URL related data.This URL is sent to context database, from context database, obtain the information such as categories of websites, turn by recommending module processing.

Queue module, for adding queue to be collected.This URL is reported to and do not crawl the collection of network address queue wait reptile instrument.

Acquisition module, for obtaining URL relevant information.Reptile instrument never gathers the related content of obtaining URL in network address queue and crawling web page/site corresponding to this network address, submit to data-analyzing machine and carry out content analysis, and stamp relevant label to the web page/site of analyzing after content, as, the division Type of website is the different Types of website such as shopping website, consulting website, news website.

Recommending module, for carrying out commending contents/input advertisement.Service end is chosen suitable content according to the classification of this website and other information and is recommended and throw in as media or advertisement.

The method and system that obtains web page listings that utilizes the present invention to propose, can effectively gather focus webpage, improves collecting efficiency.

Although below invention has been described in conjunction with the preferred embodiments, but it should be appreciated by those skilled in the art, method and system of the present invention is not limited to the embodiment described in embodiment, in the case of not deviating from the spirit and scope of the invention being limited by appended claims, can the present invention be made various modifications, increase and be replaced.

Claims

1. the method that list is recommended based on current web page, is characterized in that, described method comprises step:

S1: obtain current accessed URL;

S2: adopt Bloom Filter algorithm to judge the whether collected mistake of webpage that described URL is corresponding, if so, go to step S3, otherwise, go to step S4;

S3: go to step S6 after inquiring about the related data of described URL;

S4: described URL is added to queue to be collected, reported to and do not crawl the collection of network address queue wait reptile instrument;

S5: utilize described reptile instrument to obtain the related data of described URL;

S6: carry out commending contents or input according to the related data of described URL.

2. the method for claim 1, is characterized in that, in step S2, described employing Bloom Filter algorithm judge webpage that described URL is corresponding whether once collected mistake be specially:

After obtaining a URL, calculating respectively a corresponding k bit is 0 or 1, wherein, k is the number of the hash function of algorithm use, if be all 1 on k correspondence position, think that this URL has existed in existing set, that is, corresponding webpage is collected mistake; As long as having a value on correspondence position is not 1, all think that this URL is not in existing set, that is, corresponding webpage does not have collected mistake.

3. the method for claim 1, is characterized in that, in step S3, this URL is sent to context database, obtains categories of websites information from context database.

4. the method for claim 1, it is characterized in that, in step S5, reptile instrument never gathers the related content of obtaining URL in network address queue and crawling its corresponding web page/site, submit to data-analyzing machine and carry out content analysis, and stamp relevant label to the web page/site of analyzing after content.

5. the method for claim 1, is characterized in that, in step S6, service end is chosen suitable content according to this categories of websites and other information and recommended and throw in.

6. the system that list is recommended based on current web page, is characterized in that, described system comprises:

Acquisition module, for obtaining current accessed URL;

Judge module, for the whether collected mistake of webpage that adopts Bloom Filter algorithm to judge that described URL is corresponding, if so, processes this URL by enquiry module; Otherwise, turn by this URL of queue resume module;

Enquiry module, turns after the related data for described URL by recommending module processing;

Queue module, adds queue to be collected by described URL, is reported to and does not crawl the collection of network address queue wait reptile instrument;

Acquisition module, for utilizing described reptile instrument to obtain the related data of described URL;

Recommending module, for carrying out commending contents or input according to the related data of described URL.

7. system as claimed in claim 6, is characterized in that, in described judge module, described employing Bloom Filter algorithm judge webpage that described URL is corresponding whether once collected mistake be specially:

8. system as claimed in claim 6, is characterized in that, in described enquiry module, this URL is sent to context database, obtains categories of websites information from context database.

9. system as claimed in claim 6, it is characterized in that, in described acquisition module, reptile instrument never gathers the related content of obtaining URL in network address queue and crawling its corresponding web page/site, submit to data-analyzing machine and carry out content analysis, and stamp relevant label to the web page/site of analyzing after content.

10. system as claimed in claim 6, is characterized in that, in described recommending module, service end is chosen suitable content according to categories of websites and other information and recommended and throw in.