CN101283357A

CN101283357A - Search using changes in prevalence of content items on the web

Info

Publication number: CN101283357A
Application number: CNA2006800378127A
Authority: CN
Inventors: 史蒂芬·罗伯特·艾夫斯
Original assignee: Taptu Ltd
Current assignee: Taptu Ltd
Priority date: 2005-10-11
Filing date: 2006-10-05
Publication date: 2008-10-08
Also published as: EP1938214A1; WO2007042840A1

Abstract

A search engine has a query server (50) arranged to receive a search query from a user and return search results, the query server being arranged to identify one or more of the content items relevant to the query, to access a record of changes over time of occurrences of the identified content items, and rank the search results according to the record of changes. TMs can help find those content items which are currently active, and to track or compare the popularity of content items. This is particularly useful for content items whose subjective value to the user depends on them being topical or fashionable. A content analyzer (100) creates a fingerprint database of fingerprints, to compare the fingerprints to determine a number of occurrences of a given content item at a given time, and to record the changes over time of the occurrences.

Description

The search of having used the content item popularization on the WWW to change

Technical field

The present invention relates to search engine, be used for the content analyser of this search engine, the fingerprint database of content item uses the method for this search engine, creates the method and the corresponding program of this database.

Background technology

As everyone knows, the purpose of search engine is the address of document tabulation that retrieval is relevant with one or more search keys on WWW.Usually, search engine is one IP address is indexed (URL(uniform resource locator) (" URL "), newsgroup, file transfer protocol (FTP) (" FTP "), picture position or the like) and can be by remote access software program.This address list is " hyperlink " or the IP address tabulation of information normally, and described information is then from the index in response to inquiry.User inquiring can comprise key word, Keyword List or structuralized query statement, for example boolean queries.

Typical search engine comes " climb and get " WWW by the continuous computing machine of having stored information is carried out search, and this search engine also can produce a copies of information in " WWW mirror image ".This copy has the key word index in the document.Because any one key word in the index all might be present among the hundreds of document, therefore, concerning each key word, this index all has a pointer list that points to these documents, and has according to correlativity and come mode that these documents are sorted.These documents are to sort according to different appraisals, and these appraisals are called as correlativity, validity or are worth appraisal.META Search Engine is accepted search inquiry, should inquire about (might pass through conversion) and send to one or more routine search engines, and collect and handle response from the routine search engine, so that present a lists of documents to the user.

As everyone knows, the hypertext page is based on that the inherence of the page and external grade sort, and described inherence and external grade are then based on content and connectivity analysis.Here, connectivity is meant the hyperlink that is connected to specified page from other page, and this hyperlink is called as " backward chaining " or " returning chain ".These links can be weighted by quality and quantity, for example have the popularity of the page of these links.PageRank (TM) is the static array order that is used as the webpage of Google (TM) search engine (http://www.google.com) core.

As confirming in the United States Patent (USP) 6751612 (Schuetze), owing to all adding the huge dispersed information of quantity to WWW current every day, therefore, it is very difficult keeping up-to-date information index in search engine.Sometimes, up-to-date information is most worthy, but this information is not indexed at search engine usually.In addition, search engine can not use user's personal search information usually in the process of upgrading search engine index.Schuetze user's personal search information (or filter profiles) the relevant current information of search selectively on WWW of giving chapter and verse is more likely found the relevant information of adding recently thus.The user then provides such as inquiry and how long filter is carried out personal search information the once search.This filter calls a WWW and climbs and get device (crawler) so that the search strategy of selecting according to the user or put in order is selected on WWW the server of the selected or ordering of search.This filter is the ordering service of guiding WWW to climb to get device search predetermined quantity according to following content: (1) server has compares very relevant content (" content ordering selection ") with user inquiring; (2) server has the possibility (" frequency sequencing selection ") of the content of frequent change; The perhaps combination of (3) these contents.

According to U.S. Patent application 2004044962 (Green), current search engine system is former thereby can't return current content for two.First problem is: search engine is current, and to search the scan rate of the fresh content that changes on network very low.The routine of the best climbed get the device, just most of webpages were once visited in its about every month.If reach the high network sweep rate of about every day, will expend the bandwidth that flows to a small amount of position on the network too much so.Second problem is: current search engine is not introduced fresh content it well and " is put in order ".Because fresh content itself does not have a lot of links that are attached thereto, and therefore, it can be arranged in very high position in the PageRank of Google (TM) scheme or similar scheme.What Green proposed is to adopt a metacomputer, so that the available information recently on the collection network, described metacomputer comprises that information gathering is climbed and gets device, indicates this to climb and get the constant old information of device filtering.For importance or the correlativity of evaluating this information recently, the page with fresh content is that part sorts based on the authority of its adjacent page.Concerning fresh content, when finding this content, along with the process of time, putting in order of this content will reduce.

As at US6,658,423 (Pugh) are described, the document that repeats or approach repetition is a problem for search engine, comparatively ideal then is with its elimination, so that (i) reduce storage demand (for example being used for index and the storage demand of the data structure that obtains from index), and (ii) reduce and handle index, inquiry or the like resource needed.What Pugh proposed is to produce fingerprint for each document by following processing: (i) extract some part (for example words) from document, (ii) each part of being extracted being carried out hash handles, so which tabulation in the tabulation of definite predetermined quantity will be filled with specified portions, and (iii) is fingerprint of each list producing.Thus, repeatability can be eliminated, and perhaps can form trooping of the document that approaches repetition, and will adopt transitive attribute in this is trooped.Each document can have one and be used to discern the related with it identifier of trooping.In this alternative, in response to search inquiry, if two candidate result documents belong to same cluster, if and two candidate result documents comparably matched well in inquiry, then return one and be considered to more likely relevant document (for example according to high page ordering, more recent or the like).In climbing the extract operation process, get and conserve bandwidth in order to quicken to climb, wherein will detect and can not climb and get those webpage that almost repeats or websites, these webpages or website then are to determine from the document of before climbing the extract operation announcement.Climb get after, if find to repeat, then only one of them is indexed.Described repetition can be detected in response to certain inquiry, and will stop it is included in the Search Results, in addition, by providing and linking that the page that approaches repetition links to each other, these repeat also can be used for " determining ", and those no longer exist the disconnection of (at ad-hoc location or URL) document (for example webpage) to connect.

Summary of the invention

The present invention aims to provide improved equipment and method.

According to first aspect, the invention provides: a kind of search engine that is used to search for content item that can online access, this search engine has querying server, this server is set for the search inquiry of reception from the user, and return the Search Results relevant with search inquiry, this querying server also is set for identification one or more content items associated with the query, so that the time dependent record of appearance incident (occurrence) of the visit content item of discerning, and come Search Results is sorted, or derive Search Results with other any way according to time dependent record.

Do like this and help the user to find those current effective content items, and follow the trail of or the popularity of content item relatively.Especially, this processing is very useful to the following content item, wherein concerning the user, the subjective value of this content item by this content item whether for paying close attention at present or popular the decision.Come the existing search engine that Search Results sorts is compared to number of links and quality with only relying on the back, which content is this aspect of the present invention can identify more quickly and effectively is in the rising fashion trend and hints that thus it will be more popular or more noticeable.In addition for instance, it can also impel those content item that is in downtrending degradations.Can produce the Search Results more relevant thus with the user.

The supplementary features of some embodiment are: this search engine has a content analyser, this content analyser is set for and is fingerprint of each content item establishment, the fingerprint database that keeps these fingerprints, come true given content item a plurality of incidents that occur at the appointed time by comparing these fingerprints, and write down the described incident that occurs with the variation that occurs taking place.

These fingerprints can allow the broad medium type that comprises the Voice ﹠ Video item is compared.Especially, this is very favorable concerning the opening of type scope and WWW widely and uncontrollable feature.

The supplementary features of some embodiment are: the described incident that occurs comprises the content item copy that is in the different web pages position.This is easy to by the content that the user copies those is very useful, for example image and audio items.This feature is based on a kind of like this cognition, and that is exactly that before be regarded as the search engine problem a plurality of occur that incident (copy) is actual to can be used as the useful information source and be used to multiple purpose.

The supplementary features of some embodiment are: the described incident that occurs also comprises quoting at the given content item, wherein saidly quote of comprising in following or multinomial: at the hyperlink of given content item, at the hyperlink of the webpage that comprises technical routine, and the quoting of other type.This feature to this class of video item because the content of excessive and difficult copy or the interactive project of this class of playing is very useful.

The supplementary features of some embodiment are: search engine is set for determines a value, wherein said value representative be appearance incident from the weighted array of quoting of copy, hyperlink and other type.This weighting helps to obtain more real value.

The supplementary features of some embodiment are: search engine is set for according to one in following or multinomial quoting of copy, hyperlink and other type is weighted: their type, their position is partial to the appearance incident that those are in the position that joins with more activity and other parameter correlation thus.

The supplementary features of some embodiment are: at the index of database of content items, this querying server is set for and uses this index to select a plurality of candidate content item, and the record that then changes in time according to the appearance incident of candidate content item comes candidate content item is sorted then.This feature allows to carry out the sorting operation of computation-intensive on the more limited project of quantity.

The supplementary features of some embodiment are: a popularization ordering server, this server is according to one in following or multinomially carry out candidate content item ordering: a plurality of incidents that occur, be in the appointed day scope with interior a plurality of incidents that occur, the time rate of change of appearance incident (being referred to as the popularization rate of growth later on), the rate of change of popularization rate of growth (being referred to as the popularization acceleration later on), and with the quality metric that the website that incident is associated occurs, this feature helps to find more heterogeneous pass result, or for example relevant with the popularization (prevalence) of technical routine abundanter information is provided.

The supplementary features of some embodiment are: content analyser is set for according to the medium type of content item and creates fingerprint, and its existing fingerprint with the content item of same media type is compared.This feature can make more effective, and the searching multimedia page better.

The supplementary features of some embodiment are: content analyser is set for creates fingerprint by any way, thus for instance, for the hypertext content item, this fingerprint comprises: file size, CRC (cyclic redundancy check (CRC)), timestamp, key word, title, to sound, image or video content item, this fingerprint comprise following any various combination: image/frame sign, time span, the CRC of some or all data (cyclic redundancy check (CRC)), the metadata that embeds, the header field of image or video, medium type, mime type, thumbnail, sound signature.

The supplementary features of some embodiment are: WWW is collected server, and this server is set for which website of determining on the WWW by the frequency of visiting once again and visiting, and provides content item to content analyser thus.This WWW is collected server and can be set for according to following one or multinomially determine the website collection: the medium type of content item, content to subject classification and the change records of incident appears with website associated content item.This feature helps to allow more effectively to remain on the popularization appraisal up-to-date.

Search Results can comprise a content item list, and the ordering that cited content changed according to its time that incident occurs is indicated.This feature helps to make search to handle can return more relevant result.

Another aspect of the present invention provides a kind of content analyser of search engine, but it is set for the time change records that incident appears in establishment online access content item, this content analyser has a fingerprint generator, it is set for the fingerprint of creating each content item, and these fingerprints are compared, so that determine a plurality of incidents that occur of same content item, this content analyser is set for fingerprint is kept in the fingerprint database, and the time change records of incident appears in maintenance some content item at least, so that come it is used in response to search inquiry.

The supplementary features of some embodiment are: content analyser is set for the medium type of each content item of identification, and fingerprint generator is set for according to medium type and carries out fingerprint creation and comparison.

The supplementary features of some embodiment are: a quote server, this server are set in the page to be found at the quoting of other content item, and reference record is added in the record of appearance incident of institute's substance quoted item.

The supplementary features of some embodiment are: fingerprint generator is set for the establishment fingerprint, thus concerning the hypertext content item, this fingerprint comprises following various combination: file size, CRC (cyclic redundancy check (CRC)), timestamp, key word, title, to sound, image or video content item, this fingerprint comprise following any various combination: image/frame sign, time span, the CRC of some or all data (cyclic redundancy check (CRC)), the metadata that embeds, the header field of image or video, medium type, mime type, thumbnail, the sound signature or the signature of other any kind.

Another aspect provides one and is created by content analysis and via the fingerprint database of content item fingerprint.

The supplementary features of some embodiment are: this fingerprint database has content item and the time dependent record of incident occurs.

Another aspect provides a kind of method of using search engine, but this search engine has the time dependent record of appearance incident of the given content item of online access, this method has the following step: send inquiry to search engine, and receive the Search Results relevant with this search inquiry from search engine, this Search Results is to use content item associated with the query to occur that the time dependent record of incident sorts.

These steps are carried out at user side, and these steps reflect that the user can be benefited from more relevant Search Results and abundanter information, and wherein for instance, described information can be the information about the popularization variation.

The supplementary features of some embodiment are: Search Results comprises a content item list, and the ordering that cited content changed according to its time that incident occurs is indicated.

Above-mentioned aspect of the present invention and embodiment normally implement in computer program code, and wherein for instance, described code packages is contained in machine readable media, the especially computer system.

Thus, another aspect provides the program on a kind of machine readable media, but this program is set for a kind of method that is used to search for the online access content item of carrying out, and this method has the following step: receive search inquiry, discern one or more content item associated with the query, visit the time dependent record of appearance incident of the content item of discerning, and come the range searching result according to change records.

The supplementary features of some embodiment are: this program is set for Search Results is used for following one or multinomial: the popularization of estimating copyright work, estimate the popularization of advertisement, get the WWW collection that device is concentrated the website for climbing, so that incident occurring according to the content item of which website more changeableization takes place carries out to climb and gets, make content analyser be absorbed in the more new portion that occurs the fingerprint database of the more website of event change from content item, from appearring in the given content item, the change records of incident infers, so that estimate following popularity, come to be the advertisement price according to the event change rate occurring, come to download price for content item according to the rate of change that incident occurs.

Any one supplementary features can be combined, and can combine with any aspect.Those skilled in the art, other advantage, especially those advantages that surmount prior art will be conspicuous.Under the situation that does not break away from claim of the present invention, numerous changes and modification all are feasible.Should know understanding thus, form of the present invention only is illustrative, and this form does not limit the scope of the invention.

Description of drawings

Will come with reference to the accompanying drawings now to describe how to implement the present invention for example, wherein:

Fig. 1 shows is topological structure according to the search engine of an embodiment,

Fig. 2 shows is overall process view according to an embodiment,

What Fig. 3 showed is to handle according to the content analyser of an embodiment,

What Fig. 4 showed is to handle according to the querying server of an embodiment,

What Fig. 5 showed is to handle according to the querying server of another embodiment,

Fig. 6 shows is content analyser according to another embodiment,

What Fig. 7 showed is to collect database according to the WWW of another embodiment,

What Fig. 8 showed is to sample according to the fingerprint database of another embodiment,

What Fig. 9 showed is the keyword database sampling, and

Figure 10 shows is content analyser according to another embodiment.

Embodiment

Definition

For instance, content item can comprise the content of webpage, text extract, news item, image, sound or video clipping, interactive entertainment or numerous other types.Concerning " but online access " content, it is defined by having comprised at least the project that is on the worldwide website page, project in the dark net (for example can pass through the project database of webpage queried access), can be on the company's Intranet inner project of obtaining, or comprise any online database in online merchants and market.

In the context of quoting about content item, term " is quoted " and (reference) is defined as comprising at least hyperlink, thumbnail, summary, comment, extracts, sampling, translation and derivant.

The variation that event change can be indicated the variation that event number occurs and/or incident quality or characteristic be occurred occurs, for example the position is moved to more popular and effective place.

" key word " can comprise the words or the phrase of text, or comprises any pattern of sound or image signatures.

Hyperlink is intended to comprise hypertext, button, soft key, menu, navigation bar, or any demonstration indication or auditory cues that can be provided different content by user's selection.

Term " comprises " as the open-ended term use, and it does not get rid of other project and cited project.

Fig. 1, overall topological structure

What describe in Fig. 1 is the overall topological structure of first embodiment of the invention.What Fig. 2 showed is some summary of mainly handling.In Fig. 1, querying server 50 and WWW are climbed and are got device 80 link to each other with the Internet 30 (and realize as Web server---concerning this diagram, Web server is that inquiry and WWW are climbed the whole ingredient of getting server).WWW is climbed and is got device and creep on WWW as spider, so that accessed web page 110 and make up the WWW mirror database 90 of local cache webpage.This is climbed and gets device 110 by 730 guiding of WWW collection server, and wherein which website this collection server to visiting once again and how long visiting once these websites and control, and content analyser can detect the variation that incident appears in content item thus.Index server 105 is constructed web page index 60 from this WWW mirror image.Webpage of accumulating in 100 pairs of WWW mirror images of content analyser and the multimedia file that is associated are handled, and draw finger print information from this each multimedia file wherein.This finger print information is to obtain in the inside of fingerprint database 65.In addition, also shown a popularization ordering server 107 in Fig. 1, this server can be according to calculate ordering and other popularization from the tolerance of fingerprint database.This system can be made of the numerous servers and the database that are distributed on the network, and in principle, they also can merge on the independent position or machine.The term search engine can be illustrated in this example the front end as querying server, and use for querying server some, all rear ends or be not the rear end.

By desk-top computer 11 or mobile device 10 and a plurality of users 5 that link to each other with the Internet can carry out search by querying server.Concerning the user (" mobile subscriber ") who on mobile device, carries out search, these users link to each other with the wireless network 20 of Virtual network operator management, and this wireless network transfers to link to each other with the Internet via WAP gateway, ip router or other similar devices (clearly showing).

It is contemplated that multiple variation here, for instance, content item can be in other position except that WWW, and content analyser can be obtained described content from content sources rather than WWW mirror image, and is like that.

Device description

The user can be from the computing equipment access search engine of any kind, comprising desk-top computer, laptop computer and handheld computer.The mobile subscriber then can use mobile device, for example is similar to phone and the hand-held set of communicating by letter on wireless network, or the wireless connections mobile device of any kind, comprising PDA, notebook, point of sales terminal, laptop computer or the like.Each equipment has all comprised one or more CPU, storer, I/O equipment usually, for example numeric keypad, keyboard, microphone, touch-screen, display, and wireless network radio interface.

These equipment can move Web-browser or microbrowser application, for example Openwave usually ^TM, Access ^TM, Opera ^TM, and these application programs can visit webpage by the Internet.These webpages can be common html web pages, and perhaps they also can be to use each subset of the HTML that comprises cHTML, DHTML, XHTML, XHTML Basic and XHTML Mobile Profile and variant and be the specifically created page of mobile device.

Server is described

As described below, in a embodiment, wherein imagined the server of four kinds of main types according to search engine of the present invention shown in Figure 1.Though being illustrated as is separate server, identical functions can be provided with or divide in different ways, moves so that operate on the server of varying number or as the varying number process, or is to be moved by different tissues.

A) querying server, this server process are delivered to it other server thus from the search inquiry of Desktop PC and mobile device, and take the circumstances into consideration response data is formatted in the webpage into dissimilar device customizings.As selection, this querying server can be at remote location and at the Background Job of the search engine front end of other tissue.As selection, this querying server can increase tolerance according to popularization and carry out the Search Results ordering, and perhaps this processing also can be carried out by independent popularization ordering server.

B) WWW is collected server, this server is climbed one or more WWW and is got the device channeling conduct, make it to travel through WWW, the webpage with its process is loaded in the WWW mirror database thus, and described database then is used to later index and analyzing and processing.This WWW is collected server to revisiting which website and how long visiting once and control, and event change occurs so that can detect.This server has kept collecting as the WWW of the url list that will climb the page got or webpage.In addition, this is climbed and gets device is well-known equipment or software, here needn't be described in more detail it thus.

C) index server, but this server make up and be in the WWW mirror image and according to the search index of all webpages of index stores, this index has comprised relevancy ranking information, allow thus to send search result list according to relevancy ranking to the user.This index normally indexes with the key word that comprises in content ID and the content.

D) content analyser server, this server reads in the multimedia file of collecting on the WWW mirror image, according to classification it is classified, and derive a peculiar fingerprint (more details about this processing can vide infra) for each classification, wherein this fingerprint will serve as the fingerprint of this document.These fingerprints will be saved in the database, and wherein the index write with index server of this database is kept at.This server can also serve as quotes processor, and this processor is provided in the page to be found at the quoting of other content item, and reference record is added in the appearance logout of related content item.

The Web server program is that querying server and WWW are climbed a whole ingredient getting server.By implementing these programs, can move Apache ^TMOr some similar program, handle and be connected HTPP and the session of FTP communication protocol of a plurality of whiles that the user on the Internet carries out thus.This querying server links to each other with database, this database storing the detailed device profile information relevant with mobile device and desktop device, comprising the information of the ability that especially operates in browser on the equipment or microbrowser about device screen size, capacity of equipment.This database can also be stored independent subscriber profile information, can carry out personalisation process to equipment thus, so that adapt to independent user's needs.This information both can comprise the use historical information, also can not comprise this information.

Search engine system comprises that WWW is climbed and gets device, content analyser, index server and querying server.It has adopted from user's search inquiry request and has imported as it, and return divided priority search result list as output.Calculate by being about in greater detail the plurality of optional technology by search engine about the relevancy ranking of these Search Results.

Concerning the degree of correlation, what be used for it is calculated mainly is that popularization rate of growth and popularization acceleration are estimated.The variation of popularization can represent to show that this content is current popular especially still popular especially, and this will help search engine to improve the degree of correlation or raise the efficiency.Concerning some content, for example webpage, they can sort by prior art known in the art, and the content of multimedia of image, audio frequency and so on then can change by popularization and sorts.The type of ordering can be selected by the user.For example, pass through Google ^TMPageRank ^TMAnd so on based on the appraisal of routine citation (citation) or by other appraisal relevant with popularization, can provide one to search for selection for the user.

Method is described, Fig. 2,3,4

Fig. 2 has shown the general survey of various processing with the process flow diagram form.In step 200, webpage will be climbed and be got, and these webpages will be scanned or resolve, so that detect content item and create the fingerprint of each content item.These fingerprints will be kept in the fingerprint database, and are indexed by content item ID.In step 210, wherein will scan and create fingerprint to next webpage, in step 220, this fingerprint will be compared with the existing fingerprint of same media type, so that the appearance incident that identification repeats.In step 230, will write down the T/A (popularization tolerance) of repetition.In step 240, wherein will periodically visit the website WWW of regulation again and collect, and the page is rescaned, so that upgrade fingerprint database, and upgrade popularization thus.In step 250, wherein will calculate the popularization tolerance that event change rate and so on occurs.In step 260, wherein will calculate the content item ordering according to the popularization measure of variation.This processing will repeat for next webpage, and perhaps in step 270, at any time, querying server will make index of reference and/or tolerance and/or ordering that the data library inquiry is made response.

Fig. 3 and 4 shows is respectively by the general survey of the step of content analyser and querying server processing execution.In step 300, content analyser scans content item, wherein said content item are usually from the WWW mirror image.310, will create fingerprint.320, this fingerprint relatively is so that find the appearance incident that repeats.330, server will write down and event time occur, and keeps the change records of the appearance incident of given content item.What Fig. 4 showed is the basic step that querying server is handled.In step 400, receive inquiry.410, make index of reference find the content item relevant with this inquiry.420, about the appearance event change of technical routine record with accessed.430, this processing will be according to described variation and is determined response at this inquiry according to other parameter alternatively.

Querying server, Fig. 5

Another embodiment that in Fig. 5, has shown the querying server operation.In this example, in step 500, it receives key word or words there from the user.In step 510, the querying server basis is according to the precalculated ordering of key word and make index of reference find out thousand related content items ID of preceding n of document or multimedia file (hitting) form.In step 520, the fingerprint measurement server calculates popularization growth, popularization rate of growth and popularization and increases acceleration, and use these tolerance and use fingerprint database to calculate the ordering that these hit, as selection, this server can also use based on historical record or the epidemic appraisal weighting of website and carry out aforementioned calculation.In step 530, querying server uses popularization tolerance, popularization ordering and key word ordering to determine the combination ordering.In step 540, this querying server returns to the user with ranking results, wherein this result alternatively with subscriber equipment, preference or the like fit.As an alternative, in step 550, querying server is further handled the result, for example determine defrayment by the popularization of returning copyright work or advertisement, be used for more WWW collection and concentrated (focusing) content analyser of adjusting of the website of new database by providing to feed back to concentrate to adjust, provide the figure of tolerance or trend to compare by the popular journey that provides extrapolation method to estimate future, or measure to determine the price of advertisement or download according to popularization.In addition, it is also contemplated that the alternate manner that has used popularization tolerance here.

Querying server can be set to enable the more Advanced Search except that keyword search, so that dwindle the hunting zone by date, geographic position, medium type or the like.In addition, querying server can also adopt the graphic form display result, so that show the popularization growth curve figure of one or more content items.This querying server can also be set to carry out extrapolation process from the result, thus for instance, it can be predicted the peak value popularization of given content item.In addition, another option can be the indication that shows about the credible result degree, for example heavily visit the frequent degree of related web site and when finding that last incident occurs elapsed time, or other statistical parameter.

Content analyser, Fig. 6

Another embodiment that in Fig. 6, has shown the content analyser operation.In this example, in step 600, webpage of scanning from the WWW mirror image.In step 610, the medium type of the file in the page will be identified.In step 620, according to the medium type of file, each file all has been employed analytical algorithm, so that draw its fingerprint.In step 630, this fingerprint will compare with other fingerprint in the fingerprint database, so that seek coupling.If the coupling of discovery, so in step 640, this processing will the incremental data storehouse appearance incident technology in the record, and will write down a timestamp, as selection, it also can add new URL in the record, can come new appearance incident is weighted by the position thus, perhaps will have a backup URL thus.In step 650, if coupling not, so it can be in database service time stab and create a new record.In step 660, any one URL in the page is analyzed, and compares with the fingerprint URL of fingerprint database or other position.If the coupling of discovery, this processing will increase progressively the reverse link counting of the corresponding fingerprint of URL sensing so.Same processing also can for example be quoted at the text of author or title for the enforcement of quoting of other type.In step 670, wherein will be for the next page repeats this processing, through the time period at be set after, be in and specify the page of WWW in collecting to be rescaned, change so that determine it, and the popularization that keeps this WWW to collect at least changes appraisal for up-to-date.It will be representational that selected WWW is collected.

Now, will different treatment steps be discussed in greater detail hereinafter.Embodiment then can have any combination of described various features, so as with use fit.

Step 1: the WWW of definite website that will monitor is collected.This WWW is collected should be enough big, so that the typical website sampling that comprises the content type that monitors is provided, collect should be enough little for this WWW in addition, gets device and carry out regularly and the frequently heavily visit of (for example every day) so that climbed by one group of WWW.

Step 2: the WWW that setting and these websites meet is climbed and is got device, and creates the mirror image that comprises the inner webpages of all these websites.

Step 3: in each time period, the file in the scanning WWW mirror image is for each named web page is identified in the file class quoted in this page (for example audio frequency midi, audio frequency MP3, image JPG, image PNG).

Step 4: use appropriate analyzer algorithm for each classification, wherein this algorithm reads file, and seeks unique finger print information.This processing can be handled by the fingerprint of any kind and carry out (some example in vide infra).

Step 5:, and concerning each page and the file in this page, found, identifier information is compared with existing fingerprint database in each time period.Determine whether this fingerprint is complementary with existing fingerprint (accurate match and be in 99% thisly determine that the identical statistical probability limit of these content items is with interior coupling).

Step 6a: if this fingerprint not with database in any fingerprint matching, then create new fingerprint example, and it be linked to the webpage URL in its source with timestamp, with this as new data-base recording.The information that is included in this database will record:

Content of multimedia classification: (for example audio frequency)

Multimedia file type: (for example MP3)

File fingerprint: (scale-of-two that normally calculates or ASCII sequence)

The WWW mirror URL:

Web page resources URL:

Webpage is deposited in the time of mirror image:

The time (mark fingerprint) of identification file:

Step 6b: if fingerprint not with database in existing fingerprint matching, then this identifier count is added 1, and the new URL information that is associated with this document of record and temporal information in database (deposit webpage time of mirror image in, discern the time of this document).

Step 7:, the complete list of event number occurs for each fingerprint of page makeup that the appointment WWW is collected and periodicity is searched for of website along with the process of time.For instance, this incident value occurs and can be weighted, thereby is partial to the appearance incident that those are in the very high website of validity.This point can determine that wherein said other tolerance comprises the website that starts of the content item of quick growth from reverse link counting or other tolerance, in this case, popularization ordering server can feedback information, so that adjust weighting.

In addition, this incident value occurs and it is also conceivable that information except repeatability.The described incident value (O) that occurs can calculate to the weighted sum that links and quote from copy, back, wherein:

Copy (=D) be in the different web pages position content item repeat copy, wherein this position by mate its separately fingerprint assess, and this is comprising approximate match.

The back to link (=B) can comprise link to each other with content item or with quote or comprise the hypertext link that the webpage from the specific content item of other webpage links to each other.

Quote (=R) can comprise in following one or multinomial: extracts, summary, comment, translation, thumbnail, content item is adaptive or the quoting of other any kind (supposing that this is quoted has comprised the enough information that is associated from initial project or with prototype project, so that can infer the relation with prototype).

O＝D+x(expB?x?C1)+y(expR?x?C2)

Wherein x, y, C1 and C2 are constants, and expB and expR are the exponential functions of B and R.

This algorithm only is an example, and can expect multiple other algorithm.In practice, this algorithm can regularly change, so that the commercial user who attempts its ordering of artificial affecting is counted.

Step 8: with the summation of each fingerprint with compare from the summation of previous time section.Calculate appropriate appraisal (for example speed, acceleration) in the variation between the appearance incident from these time periods, and these values are write among the index of corresponding fingerprint.These values will be used to calculate relevance ranking, and described relevance ranking can be written into index equally.

Step 9: have key word or key combination and during with search inquiry that certain content classification (for example audio frequency) is associated when receiving, described one or more key word will be used as the search terms of index, then, this index will return a web page listings that has comprised the content of multimedia file that is complementary, and these pages are to sort according to the selected variation (for example speed, acceleration) that the appearance incident of its multimedia file that comprises is estimated.

Step 10: the user is the selection result page (perhaps alternatively select be extracted object) from the results list, and can check or play those in this page internal reference and have a multimedia object of the height ordering that calculates.

Described fingerprint can be the fingerprint of any kind, and its example can comprise the various combination arbitrarily of the following aspect of content item (normally metadata, but be not limited to this):

-size

-image/frame sign

-time span

The CRC of-some or all data (cyclic redundancy check (CRC))

The metadata of-embedding, for example: the header field of image, video or the like 1,

-medium type or mime type

Current, to handle on a large scale and all the elements of all types of multimedia files are analyzed if carry out, it assesses the cost will be very high.But the technology that is used to reduce this burden also exists.Concerning music file, applied technology is that near the content information the file starting position is analyzed, and by it being handled the fingerprint that extracts unique signature or forms of identifier.Concerning the Midi file, they can be handled in the following way: what they were very little and they comprise itself is numeral rather than analog information.In addition, there are some systems, can be with very high precision identification music file (Shazam ^TM, Snocap ^TM).Concerning video file and other file type, corresponding signature also it is contemplated that.

WWW is collected, Fig. 7

What Fig. 7 showed is the example that WWW is collected database.Show three WWW and collect, collection but more also can exist.WWW is collected 700 and is used for video content, and it has page listings or URL according to theme, perhaps preferably has list of websites or URL, in other words, this collection according to be the different content classification, for example motion, pop music, shopping or the like.WWW is collected 710 and is used for audio content, and it has the url list that is used for different themes equally.WWW is collected 720 and is used for picture material, and it also has the url list that is used for different themes.If there are a lot of content items,, so at this moment will use these WWW to collect to such an extent as to upgrade the processing of popularization tolerance unrealistic by heavily visiting all these content items.Thus, it is about heavily typical case's selection of popular or effective website of visit more continually that WWW is collected, but described selection must change so that can accurately monitor popularization enough greatly, perhaps monitors the relative variation of popularization at least.

Provide the WWW that is used to keep WWW to collect to collect server 730, so that keep described collection representative, and the counterweight visit is regularly controlled.Concerning different medium type or subject categories, its demand to renewal frequency or WWW collection size might be different.Heavily visiting frequency can carry out adaptive according to popularization popularization rate of rise and the popularization acceleration measurements that server generates that sort.For example, quicken the website that numerical value is associated increasing with relative higher popularization rate of growth and popularization, its heavy visit frequency can adjust upward automatically, concerning the website with relatively low numerical value, heavily visits frequency and can adjust downwards automatically.This adaptation processing can also be higher based on the ordering of which website, wherein said ordering according to be to sort to linking in key word or back.In addition, this renewal also can manually be carried out.Visit in order to control heavily, WWW is collected server and can be climbed to WWW and get device and present a url data stream, and can be used for which page to content analyser warning mirror image and be updated and should have rescaned content item and change.Content analyser can be set to carry out an initial operation, does not change when described webpage is from last scanning so that found before it is handled for the All Files complete fingerprint recognition in the page.

Database, Fig. 8,9

Fig. 8 shows the extracts example of the fingerprint database that has shown the record in every row.Show three row, but actual in practice might there be millions of row.Each fingerprint all has a record with fingerprint value, be initial or the URL that starts then, Keyword List (SINGER for example, BEATLES, PENNY LANE), medium type (for example RINGTONE) then is to be in not same date (T1 afterwards, T2...) a series of occur the incident value (Count1, Count2).These the incident value occurs both can be simple count, also can be to quote the more complicated value that quantity forms by the content item of combined weighted counting and weighting as mentioned above.This record can also comprise other module that calculates, for example fixed time section (popularization speed v 12 (for example (count2-count1)/33DAYS), and the popularization acceleration A 123 on the fixed time section (T1 is to T3) on the T1～T2).According to application, much other module also it is contemplated that.Fingerprint is quoted and can be comprised associated metadata, its medium type for example, URL, address in the fingerprint database or the like.

What Fig. 9 showed is the index example with mark, and shown thus a series of content items (be in this example by sensing start content or the URL of its copy in the WWW mirror image identifies) a plurality of row.Concerning specify columns, all the elements item with nominal key all can be recorded.Record in this example has four parts (also can use more), and these four parts are set forth in four row.That first row show is the page URL with this content item.The pointer that next row have to point to the fingerprint recording in the fingerprint database is the fingerprint ID of form.The 3rd row of each record have the key word mark of this key word in specified documents.What the 4th row showed is the key word ordering of this mark with respect to other mark of same key word.Shown eight row here, shown preceding two content items of each key word thus, but in practice, content item quantity can be millions of.The purpose of this index is the highest content item of mark that allows querying server easily to obtain nominal key to be had, and produces a candidate content item tabulation, and then, this tabulation can be sorted according to the popularization module by the ordering server.

Index server will be created index, and can continue it is added when handling with fingerprint recognition new content item being carried out climb to get, and uses the information from content analyser or fingerprint database thus.Each row all has a plurality of row that are used for different key words.Key word mark (for example 654) has been represented the composite score of the degree of correlation, and for instance, the degree of correlation is to be designated as the basis with the key position that hits in quantity and the content item in the content item.In addition, for instance, can give hitting among the URL, title, author's text or metatag with more adding to weigh here, rather than the hitting of content item main body.Concerning audio frequency and the non-text items of this class of image file, they can by search in the metadata hit or involved by the key word pattern (key pattern) of searching audio signature or image and so on.In certain embodiments, the popularization module can be as the input of this mark, comes the replacement of step that candidate content item is sorted according to the popularization module or replenishes as follow-up with this.In the example shown, wherein record be the key word mark (for example 041) of document.

Key word ordering that adjacent with mark is, for example 12, in other words, this ordering is meant currently have 11 other projects, and these projects have bigger correlativity to this key word.Thus, querying server can use this index to obtain tabulation (actual is its fingerprint ID) with the maximally related candidates of nominal key.Then, the ordering server can sort to selected candidates.

Concerning a large amount of controlled content project sets, for example WWW, its index process also comprises the parse operation before the index process usually, can handle a large amount of inconsistencies and data item mistake thus.Can keep the dictionary that a relevant institute might key word here, and between a plurality of index servers of concurrent working shared this dictionary.This dictionary also can be a large amount of entities with millions of words.In addition, index also comprises sort result usually and produces ranking value.This index can parse all hyperlink in each webpage, and its information is kept in the anchor file.In addition, this processing can also be used for determining each link source and point to, and can determine the text that links.

Content analyser, Figure 10

Figure 10 shows the synoptic diagram of the example of content analyser, and wherein this analytical implement is useful on the fingerprint generator of various different media types.The page with content item will be scanned, and the item of different media types will be found and be delivered to fingerprint generator 800.In these processing or the server each all can be created and the comparison fingerprint in aforesaid mode, and constructs aforesaid one or more fingerprint database.This database can have the storer that embeds or separate, and wherein this storer has the index that points to the fingerprint ID in the fingerprint database, and about the record of ordering and module.Figure 10 has shown how querying server 50 can visit these records and index.In addition, this querying server also is set to access means information 830 and user's historical record 840.

Further feature

In an optional embodiment, search does not relate to whole WWW, and relates to the finite part or the specified database of WWW.

In another optional embodiment, querying server also serves as META Search Engine, entrusts other search engine that the result is provided (Google for example thus ^TM, Yahoo ^TM, MSN ^TM), and merging is from the result in more than one source.

In an optional embodiment, WWW mirror image be used to the to derive content summary of content item.These summaries can be used to form Search Results, so that provide than URL or Keyword List Useful Information more.Especially, this processing is very useful concerning this class large-content item of video file.These summaries can be kept at fingerprint, but because it has different purposes with key word, therefore they are also inequality under many circumstances.Content summary can comprise certain aspect of webpage (for example from WWW, in-house network or other online information database), wherein said aspect can be used as discrete useful information unit and draws from webpage/extract/and resolve and obtain.The reason that is referred to as summary is: it be in the original text can by the user understand by the abreviation abbreviated version.

The exemplary types of content summary comprises following (but being not limited thereto):

● web page text---wherein content summary is the continuous expansion from the important information carrying text of webpage, and has removed all figures and navigation elements in summary.

● News Stories, comprising the subscribe to news source of webpage and RSS and so on---wherein content summary is the text snippet from original news item, and title, date and source of news.

● image---wherein content summary is that the small-sized thumbnail of original image is represented, and the metadata of the website of filename, date created and this image of discovery and so on.

● the tinkle of bells---wherein content summary is the initial fragment of the tinkle of bells audio file, and the metadata of the vendor web site of the tinkle of bells title, Format Type, price, date created and this tinkle of bells of discovery and so on.

● video clipping---wherein content summary is the very small set (for example 4) of extracting and being set to the still image of animation sequence from video file, and metadata.

Web server can be the computing machine of PC type or can move other those can widely used compatible HTTP (HTML (Hypertext Markup Language)) the computing machine of general type of server software.This Web server links to each other with the Internet 30.And these systems can implement on numerous hardware and software platforms.

Querying server and be used for index, computation measure standard and execution and climb and get and unit climbs the server got and can realize with the hardware of standard.In general, any hardware of server assembly all comprises: central processing unit (CPU), I/O (I/O) controller, system power supply and clock source; Display driver; RAM; ROM; And hard disk drive.Network interface provides and being connected of computer network, and wherein for instance, described computer network can be the network interface of Ethernet, TCP/IP or other existing agreement.This function can be implemented in the software that resides at computer-readable medium (for example hard disk drive, RAM or ROM).The exemplary software hierarchy that is used for this system can comprise BIOS (Basic Input or Output System (BIOS)), described BIOS is one group of bottom computer hardware instruction that is generally held among the ROM, and it is used to the communication between operating system, one or more device driver and the hardware.Device driver is to be used for the peculiar code of hardware of communicating by letter between operating system and hardware periphery.Described application is a software application of using C/C++, Java, compilation or language compilation of equal value usually, and these are used and carry out expectation function, run on the operating system top and depend on operating system thus, and software code and the hardware with other carries out alternately thus.After the BIOS initialization, operating system will load, controls and move hardware.Example about operating system comprises Linux ^TM, Solaris ^TM, Unix ^TM, OSX ^TM, WindowsXP ^TMAnd equivalent.

Claims

1. search engine that is used to search for content item that can online access, this search engine has querying server, this querying server is set for the search inquiry of reception from the user, and return the Search Results relevant with search inquiry, this querying server also is set for identification one or more content items associated with the query, visit the time dependent record of appearance incident of the content item of discerning, and derive Search Results according to the record that changes.

2. the search engine of claim 1, search engine are set for according to the record that changes to come Search Results is sorted.

3. claim 1 or 2 search engine, this search engine has content analyser, this content analyser is set for and is each content item establishment fingerprint, the fingerprint database that keeps fingerprint, relatively fingerprint is determined given content item a plurality of incidents that occur at the appointed time, and the time dependent record of incident appears in establishment.

4. claim 1,2 or 3 search engine, wherein this copy that incident comprises the content item that is in the different web pages position occurs.

5. the search engine of claim 4, wherein this incident occurs and also comprises at the quoting of given content item, and this quotes of comprising in following or multinomial: at the hyperlink of given content item, at the hyperlink of the webpage that comprises specific items and quoting of other type.

6. the search engine of claim 5, this search engine are set for according to the weighted array of quoting of copy, hyperlink and other type determines that the value of incident appears in representative.

7. the search engine of claim 6, this search engine is set for according to one in following or multinomial quoting of copy, hyperlink and other type is weighted: their type, their position is partial to be in and the appearance incident of more activity with the position of other parameter correlation connection thus.

8. the search engine of aforementioned arbitrary claim, when being subordinated to claim 2, this search engine comprises the index to database of content items, this querying server is set for and uses this index to select a plurality of candidate content item, and the record that changes in time according to the appearance incident of candidate content item comes candidate content item is sorted then.

9. the search engine of claim 8, this search engine has popularization ordering server, and this server is according to one in following or multinomially carry out the candidate content item ordering: a plurality of incident appears, be in the appointed day scope with interior a plurality of rate of change that incident occurs, incident occurs, occur the event change rate rate of change and with the quality metric that the website that incident is associated occurs.

10. the search engine of aforementioned arbitrary claim, when being subordinated to claim 3, this content analyser is set for according to the medium type of content item and creates fingerprint, and its existing fingerprint with the content item of same media type is compared.

11. the search engine of aforementioned arbitrary claim, when being subordinated to claim 3, content analyser is set for the establishment fingerprint, this fingerprint comprises the following various combination at the hypertext content item: file size, CRC (cyclic redundancy check (CRC)), timestamp, key word, title, this fingerprint comprises at sound, following any various combination of image or video content item: image/frame sign, time span, the CRC of some or all data (cyclic redundancy check (CRC)), the metadata that embeds, the header field of image or video, medium type, mime type, thumbnail, sound signature.

12. the search engine of aforementioned arbitrary claim, when being subordinated to claim 2, this search engine has WWW and collects server, and this server is set for which website of determining heavily to visit on the WWW and what frequency provides content item to content analyser with.

13. the search engine of claim 12, this WWW is collected server and is set for according to one in following or multinomially determines heavily to visit: the medium type of content item, the subject classification of content item and the change records of incident occurs with website associated content item.

14. the search engine of aforementioned arbitrary claim, when being subordinated to claim 2, Search Results comprises that content item list and cited content item the time dependent ordering indication of incident occurs according to it.

15. the content analyser of a search engine, but search engine is set for the time dependent record of appearance incident of establishment online access content item, this content analyser has fingerprint generator, be set for the fingerprint of creating each content item, and compare fingerprint, so that determine a plurality of incidents that occur of same content item, this content analyser is set for fingerprint is kept in the fingerprint database, and keep at least some content items the time dependent record of incident to occur, so that use in response to search inquiry.

16. the content analyser of claim 15, this content analyser are set for the medium type of each content item of identification, and fingerprint generator is set for according to medium type and carries out fingerprint creation and comparison.

17. the content analyser of claim 15 or 16, this content analyser has the processor of quoting, this processor is set in the page to be found at the quoting of other content item, and reference record is added in the record of appearance incident of institute's substance quoted item.

18. claim 15,16 or 17 content analyser, fingerprint generator is set for the establishment fingerprint, this fingerprint comprises the following various combination at the hypertext content item: file size, CRC (cyclic redundancy check (CRC)), timestamp, key word, title, this fingerprint comprises at sound, following any various combination of image or video content item: image/frame sign, time span, the CRC of some or all data (cyclic redundancy check (CRC)), the metadata that embeds, the header field of image or video, medium type, mime type, thumbnail, sound signature.

19. a fingerprint database, wherein this fingerprint database be create by the content analyser of arbitrary claim in the claim 15～18 and this finger print data library storage the fingerprint of content item.

The time dependent record of incident appears 20. the fingerprint database of claim 19, this fingerprint database have content item.

21. method of using search engine, but this search engine has the time dependent record of appearance incident of the given content item of online access, this method has the following step: send inquiry to search engine, and receive the Search Results relevant with this search inquiry from search engine, this Search Results is to use content item associated with the query to occur that the time dependent record of incident sorts.

22. the method for claim 21, this Search Results comprise that content item list and cited content item the time dependent ordering indication of incident occurs according to it.

23. the program on the machine readable media, but this program is set for a kind of method that is used to search for the online access content item of carrying out, and this method has the following step: receive search inquiry, discern one or more content item associated with the query, visit the time dependent record of appearance incident of the content item of discerning, and return Search Results according to the record that changes.

24. the program of claim 23, this program is set for Search Results is used for following one or multinomial: the popularization of estimating copyright work, estimate the popularization of advertisement, get the collecting web page that device is concentrated the adjustment website at climbing, this is climbed and gets device and incident occurs according to the content item of which website and more changeableization takes place climb and get, concentrate to adjust content analyser and come more new portion fingerprint database according to having more diverse website that incident appears in content item, extrapolation process carried out in the record that occurs the variation of incident from the given content item, so that estimate following popularity, come to be the advertisement price according to the event change rate occurring, come to download price for content item according to the rate of change that incident occurs.