CN101361301A

CN101361301A - Detecting repeating content in broadcast media

Info

Publication number: CN101361301A
Application number: CNA2006800515590A
Authority: CN
Inventors: 舒梅特·巴卢哈; 米歇尔·科维尔; 迈克尔·芬克
Original assignee: Google LLC
Current assignee: Google LLC
Priority date: 2005-11-29
Filing date: 2006-11-27
Publication date: 2009-02-04
Also published as: CN101517550A; CN101517550B

Abstract

Systems, methods, devices, and computer program products provide social and interactive applications for detecting repeating content in broadcast media. In some implementations, a method includes: generating a database of audio statistics from content; generating a query from the database of audio statistics; running the query against the database of audio statistics to determine a non-identity match; if a non-identity match exists, identifying the content corresponding to the matched query as repeating content.

Description

Detect the duplicate contents in the broadcast medium

Related application

That the application requires to submit on November 29th, 2006, denomination of invention is the U.S. Provisional Patent Application No.60/740 of " Environment-Based Referrals ", the rights and interests of 760 priority, and this application all is herein incorporated by reference.

That the application requires to submit on August 29th, 2006, denomination of invention is the U.S. Provisional Patent Application No.60/823 of " AudioIdentification Based on Signatures ", the rights and interests of 881 priority, this application all is herein incorporated by reference.

The application is involved in submission on November 27th, 2006, denomination of invention is " DeterminingPopularity Ratings Using Social and Interactive Applications for MassMedia ", please case among the agent be numbered the U.S. Patent application of GP-672-00-US/16113-0630001 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _, and submit on November 27th, 2006, denomination of invention is " Social and Interactive Application For Mass Media ", agent's application case is numbered the U.S. Patent application No.________________ of GP-636-00-US/16113-060001.Each part in these patent applications all is herein incorporated by reference.

Technical field

Disclosed implementation relates to social activity and the interactive application that is used for masses (mass) medium.

Background technology

Traditional TV and interactive TV system do not have replay are embedded in the ability that the advertisement in the TV programme detects.Traditional recording equipment allows user storage TV programme (comprising advertisement), so that date or time is afterwards replayed.The common complaint of broadcaster is that they can not be made a profit from these are replayed, and from the angle of broadcaster, this is equivalent to advertise for the advertiser's " freely " who has bought the initial broadcast playback of program space.

Summary of the invention

Solve above-mentioned deficiency by disclosed system, method, device, user interface and the computer program that is used for detecting the duplicate contents of broadcast medium.

In some implementations, method comprises: from audio statistics amount database generated query; This inquiry of operation on audio statistics amount database is to determine aniso-(non-identity) coupling; And if have aniso-coupling, will be duplicate contents then with the corresponding content recognition of the inquiry of being mated.

In some implementations, system comprises processor and operationally is couple to the computer-readable medium of processor.This computer-readable medium comprises instruction, and when being moved by processor, this instruction makes processor carry out following operation: from audio statistics amount database generated query; This inquiry of operation on audio statistics amount database, to determine aniso-coupling, wherein this audio statistics amount generates from content; And if find aniso-coupling, will be duplicate contents then with the corresponding content recognition of the inquiry of being mated.

Other implementation relates to system, method, device, user interface and computer program.

Description of drawings

Fig. 1 is the block diagram of an embodiment of popular personalization system.

Fig. 2 shows an embodiment of audio recognition systems on every side, comprises the client-side interface shown in Fig. 1.

Fig. 3 is the flow chart of an embodiment that is used to provide the process of popular personalized application.

Fig. 4 is the flow chart of an embodiment of audio-frequency fingerprint identifying.

Fig. 5 is the flow chart that is used for an embodiment of the mutual user interface of popular personalized application.

Fig. 6 is the block diagram of an embodiment of hardware architecture that is used to realize the FTP client FTP of the client-side interface shown in Fig. 1.

Fig. 7 is the flow chart of an embodiment of duplicate detection process.

Embodiment

Popular personalized application

Popular personalized application provides personalization and the interactive information relevant with mass media's broadcasting (for example, TV, radio broadcasting, film, Internet Broadcast etc.).Such application includes but not limited to: customized information layer, self-organizing (ad hoc) are social with group community, real-time popular grade and video (or audio frequency) bookmark etc.Although in the example of these more disclosed mass media was context in television broadcasting, disclosed realization was equally applicable to radio and/or music broadcast.

Customized information course mass media channel provides side information.The example of customized information layer includes but not limited to: fashion, politics, commerce, health, travelling etc.For example, when the news footage watched about a famous person, presenting the fashion layer to spectators on the video screen or on the computer display apparatus, it provides clothes information and/or the image of dressing with this famous person in this news footage relevant with ornaments.In addition, personalized layer can comprise and be used for the sales promotion product relevant with this news footage or the advertisement of service, such as the link to the clothes shop of sale clothes that this famous person wears.

Self-organizing is social to provide the comment place with group community watching identical TV programme or listen between the user at identical radio broadcasting station.For example, can (for example provide the comment media for watching the top-line user of up-to-date CNN, chatroom, message board, the wiki page, video link etc.), this comment media allows the user to chat, comment on or read other spectators' response with regard to the broadcasting of ongoing mass media.

Real-time popular grade provides class information (being similar to Nelson (Nielsen) grade) to content provider and user.For example, can be immediately provide by user's social networks and/or by people and watch or the television channel listened to or the real-time popular grade at radio broadcasting station with similar population statistical nature to the user.

Video or audio frequency bookmark provide the low mode of paying in the personalized storehouse of the establishment broadcasted content that they like to the user.For example, the user can press the button on computer or the remote control equipment simply, writes down, handles and preserve the audio frequency on every side of broadcasted content and/or the fragment of video.This fragment can be used as the part of sensing program or program so that the bookmark of watching later on.This bookmark can be shared between friend or consult and be saved in the future individual.

Popular personalized network

Fig. 1 is the block diagram that is used to provide the popular personalization system 100 of popular personalized application.System 100 comprises one or more client-side interfaces 102, audio database server 104 and social application server 106, and all these communicates by network 108 (for example, internet, Intranet, LAN, wireless network etc.).

Client-side interface 102 can be to allow user's input and reception information, and can present any equipment of user interface on display device, includes but not limited to: desk-top or portable computer; Electronic equipment; Phone; Mobile phone; Display system; TV; Computer monitor; Navigation system; Portable electronic device/register; PDA(Personal Digital Assistant); Game console; Hand-hold electronic equipments; And EMBEDDED AVIONICS or device.To do more fully client-side interface 102 about Fig. 2 and describe.

In some implementations, client-side interface 102 comprises the audio detection device (for example, microphone) on every side that is used for monitoring and writing down at broadcast environment (for example, user's living room) audio frequency on every side of mass media's broadcasting.Audio-frequency fragments or " fragment " are converted into uniqueness and healthy and strong statistical summary around one or more, and it is called as " audio-frequency fingerprint " or " descriptor ".In some implementations, this descriptor is the compressed file that comprises one or more audio signature assemblies, and the reference descriptor or the statistic that former that generate and the mass media broadcasting in this audio signature assembly and the database can be associated compare.

At Ke, Y., Hoiem, D., Sukthankar, R. described a kind of technology that is used to generate about the audio-frequency fingerprint of music recognition among the Computer Vision forMusic Identification of (2005) (Proc.Computer Vision and Pattern Recognition), by reference its full content has been herein incorporated.In some implementations, adopting by being called hereinafter that this music recognition method that people such as " " Ke advised is that television audio data and inquiry generate descriptor, as described about Fig. 4.

The U.S. Provisional Patent Application No.60/823 that is called " Audio Identification Based on Signatures " in name in 881, has described and a kind ofly is used to use small echo to generate the technology of audio descriptor.This application has been described following a kind of technology: the descriptor/fingerprint of the compactness of the audio fragment that can effectively be mated is created in the combination of use a computer vision technique and large-scale data stream Processing Algorithm.This technology is used small echo, and small echo is a kind of famous mathematical tool that is used for the classification decomposition function.

In " Audio Identification Based on Signatures ", the realization of retrieving comprises the steps: 1) audiorange of given audio fragment, extract the spectral image that for example continues the random interval of 11.6*w ms with average d-ms.For each spectral image: 2) calculate the small echo of this spectral image; 3) extract t best small echo; 4) create the binary representation of this best t small echo; 5) use minimum hashing to create the sub-fingerprint of this best t small echo; 6) use LSH to search the sub-fingerprint fragment of tight coupling with b case (bin) and 1 hash table; 7) abandon have be less than v the coupling sub-fingerprint; 8) calculate from the sub-fingerprint of remaining candidate to the Hamming distance (Hamming distance) of inquiring about sub-fingerprint; And 9) coupling of combination is in time used Dynamic Programming.

In some implementations, will be used for the descriptor of identify customer end side interface 102 via network 108 and send to audio database server 104 with the user identifier that is associated (" user id ").Audio database server 104 is with this descriptor and a plurality ofly compare with reference to descriptor, these are a plurality of determine before with reference to descriptor being and be stored in the audio database 110 that is couple to audio database server 104.In some implementations, audio database server 104 is broadcasted the reference descriptor that constantly updates stored in the audio database 110 according to nearest mass media.

Audio database server 104 is determined at the descriptor that is received with reference to the optimum Match between the descriptor, and optimum Match information is sent to social application server 106.To do more fully matching process about Fig. 4 and describe.

In some implementations, social application server 106 is accepted to be connected with the Web browser that client-side interface 102 is associated.Use optimum Match information, the customized information of social application server 106 syndication users also sends to client-side interface 102 with this customized information.This customized information can include but not limited to: advertisement, customized information layer, popular grade, and the information that is associated with comment media (for example, self-organizing is social with group community, forum, discussion group, video conference etc.).

In some implementations, customized information can be used to spectators to create the chatroom, and needn't know the program that these spectators are watching in real time.Can create the chatroom by directly relatively determining coupling by the descriptor in the data flow of FTP client FTP transmission.That is to say, can create the chatroom around the spectators of descriptor with coupling.In such implementation, the descriptor that need not receive from spectators with compare with reference to descriptor.

In some implementations, social application server 106 provides webpage to client-side interface 102, and this webpage is by Web browser (for example, the Microsoft Internet Explorer that operates on the client-side interface 102 ^TM) receive and show.

Obviously, system 100 also can be other implementation.For example, system 100 can comprise a plurality of audio databases 110, audio database server 104 and/or social application server 106.Alternately, audio database server 104 and social application server 106 can be single server or systems, or the part of Internet resources and/or service.And network 108 can comprise a plurality of networks and link, and it operationally uses various network device (for example, hub, router etc.) and media (for example, copper cash, optical fiber, radio frequency etc.) to arrange with various topological sums and is coupled in together.The client-server architecture has only been described here as an example.It also can be other Computer Architecture.

Audio recognition systems on every side

Fig. 2 shows audio recognition systems 200 on every side, comprises client-side interface 102 as shown in fig. 1.This system 200 (for example comprises system of mass media 202, television set, broadcast receiver, computer, electronic equipment, mobile phone, game console, network equipment etc.), audio detection device 204, client-side interface 102 (for example, desk-top or laptop computer etc.) and network access equipment 206 on every side.In some implementations, client-side interface 102 comprises the display device 210 that is used to present user interface (UI) 208, so that the user can be mutual with popular personalized application, as describing about Fig. 5.

In operation, system of mass media 202 generate mass media's broadcasting (for example, television audio) around audio frequency, detect audio frequency on every side by audio detection device 204 on every side.Audio detection device 204 can be to detect any equipment of audio frequency on every side on every side, comprise free-standing microphone with the mutually integrated microphone of client-side interface 102.102 pairs of detected audio frequency on every side of client-side interface are encoded, so that the identification descriptor of audio frequency on every side to be provided.By network access equipment 206 and network 108, this descriptor is sent to audio database server 104.

In some implementations, the audio file (" fragment ") of the n second of audio frequency (for example, 5 seconds) is on every side constantly monitored and write down to the client software that operates on the client-side interface 102.According to the process of describing about Fig. 4 this fragment is converted to the k bits of encoded descriptor (for example, 32 bits) of m frame (for example, 415 frames) then.In some implementations, monitoring and record are based on incident.For example, can be on a specified date begin automatically to monitor and record, and should monitor and record continues time (for example, in the afternoon between the 8:00-9:00) of appointment with the time point (for example, the afternoon of Monday 8:00) of appointment.Alternately, can begin in response to user's input (combination of for example, click, function key or key) to monitor and record from control appliance (for example, remote controllers etc.).In some implementations, use the rheologyization of the 32 bit/frame distinctive characteristics of describing by people such as Ke to come audio frequency is on every side encoded.

In some implementations, client software operates to " sidebar " (" side bar ") or other user interface elements.By this way, when starting client-side interface 102, can begin on every side audio sample immediately and in " backstage ", move, simultaneously (alternatively) result is presented at need not to call whole Web browser session in the sidebar.

In some implementations, audio sample can be when client-side interface 102 starts on every side, perhaps begins when spectators login in into service or the application (for example, Email etc.).

Descriptor is sent to audio database server 104.In some implementations, descriptor is the compressed statistical summary of audio frequency on every side, and is described as people such as Ke.By sending statistical summary, kept user's the privacy relevant with sound, because statistical summary is irreversible, promptly can not recover initial audio frequency from descriptor.Thereby, can not reproduce user or other people from descriptor and monitor and be recorded in talk the broadcast environment.In some implementations, maintain secrecy and safety, can use one or more known encryption technologies (for example, asymmetric or symmetrical secret key encryption, oval encrypt etc.) that descriptor is encrypted for extra.

In some implementations, in response to the trigger event that detects by the monitor process on client-side interface 102, submit to (being also referred to as the query specification symbol) to send to audio database server 104 as inquiry descriptor.For example, trigger event can be the theme song (for example, opening the tune of " Song Feichuan ") of opening TV programme or the dialogue of being said by the performer.In some implementations, query specification symbol can be sent to audio database server 104 as a part that continues to flow through journey.In some implementations, can import (for example, via remote controllers, click etc.) and query specification symbol is sent to audio database server 104 in response to the user.

Popular individuation process

Fig. 3 is the flow chart of popular individuation process 300.The step of process 300 needn't be finished with any specific order, and at least some steps can be carried out in multithreading or parallel processing environment simultaneously.

When the audio fragment on every side of the mass media's broadcasting in the broadcast environment is monitored and be recorded in to client-side interface (for example, client-side interface 102), process 300 beginnings (302).Audio fragment around being write down is encoded into (for example, the statistical summary of compression) in the descriptor, this descriptor can be sent to audio database server (304) as inquiry.The audio database server will be inquired about with the reference descriptor database that calculates from mass media's broadcasting statistic and compare, to determine to inquire about with this candidate's descriptor (308) of optimum Match.This candidate's descriptor is sent to social application server or other Internet resources, and this social activity application server or other Internet resources use this candidate's descriptor to come the customized information (310) of syndication users.For example, if the user is watching TV programme " Song Feichuan ", then the query specification symbol that audio frequency generates around the program will with from before the reference descriptor obtained of " Song Feichuan " broadcasting be complementary.Thereby, use candidate's descriptor of this optimum Match to come the polymerization customized information relevant (for example, News Stories, discussion group, the link of arriving social group community together of self-organizing or chatroom, advertisement etc.) with " Song Feichuan ".In some implementations, use hashing technique (for example, direct hash or position sensing hash (LSH)) effectively to carry out matching process, to obtain the short list of candidate's descriptor, as described about Fig. 4.In proof procedure, handle candidate's descriptor then, described as people such as Ke.

In some implementations, will directly mate, rather than each inquiry and database with reference to descriptor will be mated from different spectators' query specification symbol.Such embodiment makes it possible to create about not using the social group together of the self-organizing community with reference to the theme of descriptor database.Such embodiment can be in real time to mating with spectators identical common way (for example, gymnasium, bar etc.), use portable electric appts (for example, mobile phone, PDA etc.).

Popular grade

In some implementations, tabulate from the current spectators that watching broadcasting (for example, program, advertisement etc.) and infer in real time and the statistic of polymerization.When spectators use other to use, can collect these statistics on the backstage.Statistic can include but not limited to: the average of 1) watching the spectators of this broadcasting; 2) spectators watch the average time of this broadcasting; 3) other program of watching of these spectators; 4) minimum spectators' number and peak value spectators number; 5) program that they the most often switch to when spectators leave broadcasting; 6) how long spectators watch broadcasting; 7) spectators browse the channel how many times; 8) which advertisement spectators have seen; And 9) they are the most frequent when spectators enter broadcasting switches from those programs, or the like.From these statistics, can determine one or more popular grades.

Each broadcasting channel usage counter that can align monitoring generates the statistic that is used to generate popular grade.In some implementations, can be with counter and demographic group data or geographical group data cross.When broadcasting was being carried out, spectators can use popular grade to come " checking focus " (for example, by noticing that grade constantly increases during Super Bowl half-court performance in 2004).Advertiser and content provider also can use popular grade to come dynamically to adjust in response to the grade rank material of demonstration.For advertisement, especially true, because the advertisement of the unit length of the weak point that advertising campaign is made and numerous versions is easy to exchange, to be suitable for spectators' rating.Other example of statistic includes but not limited to: television broadcasting and station broadcast are that peak value watches/listen to family in number of times, the given area to occupy number, the channel surfing total amount during specific program (program school, in a day special time), volume of broadcasting or the like at demographic or temporal popular number of times popular, in a day.

Customized information is sent to client-side interface (312).Also popular class stores can be used (318) by other process in database, dynamically adjust such as above-mentioned advertisement.Receive customized information (314) at the client-side interface, customized information is formatd and be presented on (316) in the user interface at this client-side interface.Customized information can be associated with the comment media of presenting to the user in user interface (for example, the text message in the chatroom).In some implementations, the chatroom can comprise one or more son groups.For example, the discussion group of " Song Feichuan " may comprise the child group that is called " Song Feichuan expert ", the perhaps child group that can be associated with specific demography, such as the age of watching " Song Feichuan " in the women of 20-30 between year, or the like.

In some implementations, collect the raw information (for example, count value) of the statistic be used to generate popular grade, and it is stored on the client-side interface, rather than be stored on the social application server.Online and/or when calling popular personalized application as the user, this raw information can be sent to broadcaster.

In some implementations, broadcasting measuring box (BMB) is installed on the client-side interface.BMB is similar to set-top box but the hardware simplicity equipment that is not connected to broadcasting equipment.Nelson (Neilsen) hierarchical system that hardware is installed on TV with needs is different, BMB can be installed near the of system of mass media or within the scope of TV signal.In some implementations, the automatic record audio fragment of BMB also generates descriptor, and this descriptor is stored in the memory (for example, flash media).In some implementations, BMB can comprise one or more hardware button alternatively, and the user can press the broadcasting (being similar to Nelson's grade) that these buttons indicate them watching.Sometimes can come descriptor that BMB equipment is sampled and stored to collect by the grade supplier, perhaps BMB can (for example connect by network sometimes, phone, internet, radio broadcasting, such as radio message service (SMS), or the like) descriptor of being stored is broadcast to an interested side or in many ways.

In some implementations, can monitoring of advertisement to determine advertising effect, these advertising results can be reported to the advertiser.For example, which advertisement viewed, skip advertisement audio volume level etc.

In some implementations, can use image capture device (for example, digital camera, video cassette recorder etc.) to measure how many spectators is watching or broadcast listening.For example, various known pattern matching algorithms can be applied to image or image sequence, to determine during specific broadcasting, to be present in the viewership in the broadcast environment.Image and/or can combine use with audio descriptor from the data that image is obtained is with the customized information of collecting the user, calculate popular grade or be used for other purpose.

The audio-frequency fingerprint identifying

Fig. 4 is the flow chart of audio-frequency fingerprint identifying 400.The step of process 400 needn't be finished with any specific order, and at least some steps can be carried out in multithreading or parallel processing environment simultaneously.Process 400 in real time and lowly lingeringly will go up query specification symbol that generates and the reference descriptor that is stored in one or more databases at client-side interface (for example, the client-side interface 102) and mate.Process 400 adopts by technology that the people advised such as Ke and handles voice data (for example, from television broadcasting) and inquiry on every side.

Process 400 is decomposed into frame (402) beginning of overlapping with audio fragment around the mass media's broadcasting that will be caught by audio detection device (for example, microphone) on every side (for example, 5-6 second audio frequency) on the client-side interface.In some implementations, these frames are separated several microseconds (for example, separating 12ms).Each frame is converted to by the descriptor (for example, the descriptor of 32 bits) (404) of training to overcome audio-frequency noise and distortion, described as people such as Ke.In some implementations, each descriptor is represented an identification statistical summary of audio fragment.

In some implementations, descriptor can be sent to the audio database server as query fragment (being also referred to as the query specification symbol), at audio database server place with this descriptor with mate the statistical summary (406) of the audio fragment of mass media's broadcasting of record before wherein being used to discern with reference to descriptor with reference to descriptor database.Can determine to have the tabulation (408) of candidate's descriptor of optimum Match.Can mark to candidate's descriptor, make to accord with corresponding to candidate's descriptor than according with consistent inadequately candidate's descriptor scoring high (410) with query specification in time with query specification in time.Candidate's descriptor that will have a highest scoring (for example, scoring has surpassed a sufficiently high threshold value) send to or otherwise offer social application server (412), can use these candidate's descriptors to come the polymerization customized information relevant at social application server place with media broadcast.Use threshold value to guarantee that descriptor fully mates (412) before sending or otherwise provide a description symbol to social application server.

In some implementations, can from provide by each media companies can be indexed and the broadcasting that is used to generate descriptor generate with reference to descriptor database.In other implementation, metadata and/or the information that also can use TV guide or other to be embedded in the broadcast singal generate with reference to descriptor.

In some implementations, can use speech recognition technology to help discern which program just viewed.Such technology can help the user that media event is discussed, but not just TV programme is discussed.For example, the user may watch space shuttle to take off by different with another spectators channels, thereby may obtain different audio signal (for example, because different newscasters).Can use speech recognition technology to discern keyword (for example, space shuttle, take off etc.), and these keywords can be used to the user is coupled together with the comment media.

Hash is described

People such as Ke use a computer vision technique come for audio frequency find the discrepancy in elevation other, compact statistic.Its process based on the mark of the example (wherein x is the noise form of identical audio frequency with x ') in front and negative example (wherein x with x ' from different audio frequency) to training.During this training stage, use of the combination of this mark based on the machine learning techniques of boosting (lifting) to selecting to constitute by 32 filters and threshold value, these 32 filters and threshold value are created other statistic of the discrepancy in elevation jointly.By using the first and second jump branches for time and frequency, filter makes to change and is confined to the spectrogram magnitude.Use a benefit of these simple difference filters to be: by using by Viola, P. and Jones, M. (2002) are at Robust Real-TimeObject Detection, integral image techniques described in the Internatinal Journal of Computer Vision, can calculate them effectively, by reference its full content is herein incorporated.

In some implementations, these 32 filters are output as threshold value, suppose bit of each filter of each audio frame.These 32 threshold value results only form the transmission descriptor of this audio frame.This sparse coding has guaranteed that user's privacy avoids unauthorized eavesdropping.And, the descriptor of this 32 bit is healthy and strong for the audio distortion in the training data, makes positive example (for example, coupling frame) have little Hamming distance (promptly measuring the distance of different bit numbers), and negative example (for example, mistake is mated frame) has big Hamming distance example.It should be noted that and to use more or less filter, and can use more than a bit (for example, using a plurality of bits of many threshold tests) for each filter at each audio frame.

In some implementations, the descriptor of this 32 bit itself is as the hash key assignments of direct hash.This descriptor is the good hash function of balance.By not only query specification symbol being inquired about, and inquire about the similar descriptor of a group (is 2 until the Hamming distance with initial query specification symbol), can further improve retrieval rate.

Time consistency in the inquiry

State Hash process in the use with after query specification symbol and the audio database coupling, these couplings are verified, returning in hitting which with the specified data storehouse is accurate match.In addition, candidate's descriptor may have many and the query specification symbol is complementary but has the frame of wrong time structure.

In some implementations, support that by browsing each database of coupling hits under specific Query Database side-play amount, realize checking.For example, if the 8th descriptor (q in " Song Feichuan " query fragment q of 5 seconds 415 frame lengths ₈) hit the 1008th database descriptor (x ₁₀₀₈), then its supports in audio database between inquiry in these 5 seconds and 1415 the candidate matches from frame 1001 to frame.At q _nAnd x _1000+nBetween other coupling of (1≤n≤415) will support same candidate matches.

Except time consistency, we need consider the frame when audio frequency is on every side flooded in session temporarily.This can be modeled as the special switch between the audio frequency and interference sound on every side.For each inquiry frame i, there is the variable y that hides _iIf: y _i=0, i frame then will inquiring about only is modeled as interference; If y _i=1, then i frame is modeled as from audio frequency around pure.Take extreme observation (pure around or pure interference) to prove that this is correct, at two supposition (y _i=0 and y _i=1) under each the supposition situation in, by providing extra bit-flop probability in 32 positions of frame vector each, each audio frame is expressed and softization with extremely low accuracy.Finally, utilize the transition probability of obtaining from training data, we will be modeled as the first hiding rank Markov process in the intermediate frame conversion between the pure state and pure interference on every side.For example, we can reuse the probabilistic model of 66 parameters that provided by people such as Ke on CVPR in 2005.

Ambient data storehouse vector x on query vector q and N frame shifting amount _NBetween final matching probability model be:

Wherein＜q _n, x _mBe illustrated in 32 bit frame vectors q _nAnd x _mBetween bit difference.This model not only merged the time consistency constraint but also merged hide around/Markov model that disturbs.

Back coupling consistency is filtered

People can talk with other people when seeing TV usually, cause fragmentary but very strong sound interference, particularly when using when sampling on every side audio frequency based on the microphone of kneetop computer.Suppose that the dialogue speech continued for two or three seconds, the simple communication exchange that carries out between spectators may make inquiry in 5 seconds become and can not discern.

In some implementations, using the back coupling to filter the mistake of handling these desultory low confidences mates.For example, we can use the Markov model of the hiding duration that channel switches, and its mid band is switched the time of staying (that is, the time between channel changes) of the expectation with L second.Social application server 106 is designated as a part with each client session associated state information with the coupling that had high confidence level (together with the confidence level of its " discount ") in the nearest past.Use this information, which has higher confidence level according to, server 106 is selected the content indexing coupling from the nearest past, perhaps selects current index coupling.

We use M _hAnd C _hRefer to the optimum Match and the scoring of likelihood confidence level thereof of a time step long (before 5 seconds).If we just are applied to Markov model the optimum Match before this simply, and do not consider another observation, and our being contemplated to be then, the optimum Match of current time is identical program sequence, only extended forward 5 seconds, and our confidence level is C in this expectation _h-l/L, l=5 second is the query time step-length here.Discount l/L in likelihood is estimated is corresponding in length being the Markov model probability e of switching channels not during the time step of l ^-l/L

Generate an alternative prerequisite hypothesis by the audio frequency coupling for current inquiry.We use Mo to refer to the optimum Match of current audio fragment: that is the coupling that produces by audio-frequency fingerprint identifying 400.Co is the likelihood confidence level scoring that is provided by audio-frequency fingerprint identifying 400.

If the given coupling of these two couplings (the history expectation of having upgraded and the observed result of current fragment) is different, we select the hypothesis that has higher confidence level to mark most:

M wherein ₀Be the coupling that is used for selecting related content by social application server 106, and in next time step, M ₀And C ₀Become M _hAnd C _h

User interface

Fig. 5 is the flow chart that is used for an embodiment of the mutual user interface 208 of popular personalized application.User interface 208 comprises personalized layer viewing area 502, comment media viewing area 504, sponsored link viewing area 506 and content display region 508.Personalized layer viewing area 502 provides and relevant side information and/or the image of video content that shows in content display region 508.Can use navigation bar 510 and input equipment (for example, mouse or Long-distance Control) this personalization layer that navigates.In navigation bar 510, each layer all has the label that is associated.For example, if the user selects " fashion " label, then in viewing area 502, the fashion layer of the content that comprises the relevant fashion that is associated with " Song Feichuan " will be presented.

In some implementations, client-side interface 102 comprises the display device 210 that can present user interface 208.In some implementations, user interface 208 is the interaction network pages that provided by social application server 106, and is present in the browser window on the screen of display device 210.In some implementations, user interface 208 is permanent, and after the broadcast audio that is used for the content match process moved in time, this user interface still can be used for alternately.In some implementations, along with the migration of time or new user interface 208 more in response to trigger event (for example, the new person enters the chatroom, advertisement begins etc.) and dynamically.For example, when commercials, can utilize to link 518 with relevant the refreshing of theme of advertisement and upgrade sponsored link viewing area 506.

In some implementations, can be in the time after a while the information of personalization and sponsored link are sent to spectators with Email or be presented on the sidebar.

In some implementations, client-side interface 102 receives customized information from social application server 106.This information can comprise webpage, Email, message board, link, instant message, chatroom or add ongoing discussion group, eRoom, video conference or Web conference, audio call (for example, Skype

) invitation etc.In some implementations, user interface 208 provides from the comment of broadcasting of before having seen or film and/or to the visit of the link of comment.For example, if the user is the current DVD " Shrek " that watching, he may want to look at what to be said about this film people in the past.

In some implementations, viewing area 502 comprises hierarchical region 512, and it is used to show the popular grade relevant with broadcasting.For example, viewing area 512 can show and compare in another TV programme of broadcasting simultaneously that current how many spectators of having watch " Seinfeld ".

In some implementations, comment media viewing area 504 presents the environment of chatroom style, and a plurality of therein users can make comments to broadcasting.In some implementations, viewing area 504 comprises the text box 514 that is used to import comment, and by using input mechanism 516 (for example, button), this comment is sent to the chatroom.

Sponsored link viewing area 506 comprises information, the image relevant with following the broadcasting associated advertisement and/or links.For example, in the link 518 can guide to the user website of selling " Song Feichuan " commodity.

Content display region 508 is places of display of broadcast content.For example, can show scene, and have other relevant information (for example, feelings section number, title, timestamp etc.) from current broadcast.In some implementations, viewing area 508 comprises the control 520 (for example, scroll button) of the displaying contents that is used to navigate.

Video bookmarks

In some implementations, comprise button 522 in content display region, it can be used to video to add bookmark.For example,, " Song Feichuan " plot that is presented in the viewing area 508 is joined liking in the video library of user, can come program request to watch it by streaming application or other access method then based on web by button click 522.According to the set strategy of content owner, the service of this stream can provide the free playback of only Gong browsing, and collects the charges as content owner's agency, and perhaps insertion will provide the advertisement of expense to the content owner.

The hardware architecture of client-side interface

Fig. 6 is the block diagram of the hardware architecture 600 of the client-side interface 102 shown in Fig. 1.Although computing equipment is (for example typically for hardware architecture 600, personal computer), but disclosed implementation can realize that these equipment include but not limited to any equipment that can present user interface on display device: desk-top or portable computer; Electronic equipment; Phone; Mobile phone; Display system; TV; Monitor; Navigation system; Portable electronic device/register; Personal digital assistant; Games system; Hand-hold electronic equipments; And EMBEDDED AVIONICS or device.

In some implementations, system 600 comprises one or more processors 602 (for example, CPU), optional one or more display devices 604 (for example, CRT, LCD etc.), microphone interface 606, one or more network interfaces 608 are (for example, USB, Ethernet, FireWire

Port etc.), optional one or more input equipment 610 (for example, mouse, keyboard etc.) and one or more computer-readable medium 612.In these assemblies each all operationally is couple to one or more bus 614 (for example, EISA, PCI, USB, FireWire

, NuBus, PDS etc.).

In some implementations, do not have display device or input equipment, and system 600 only carries out sampling and coding (for example, generating descriptor etc.) and need not user's input on the backstage.

Term " computer-readable medium " is meant that participation provides instruction for any medium of carrying out to processor 602, includes, without being limited to: non-volatile medium (for example, CD or disk), volatile media (for example, memory) and transmission medium.Transmission medium includes, without being limited to: coaxial cable, copper cash and optical fiber.Transmission medium also can employing sound, the form of light or rf wave.Computer-readable medium 612 further comprises operating system 616 (for example, Mac OS

, Windows

, Unix, Linux etc.), network communication module 618, client software 620 and one or more application 622.Operating system 616 can be multi-user, multiprocessing, multitask, multithreading, real-time etc.Operating system 616 is carried out basic task, includes but not limited to: identification is from the input of input equipment 610; Send output to display device 604; Follow the tracks of file and catalogue on the memory device 612; Control peripheral devices (for example, disk drive, printer, image capture device etc.); And manage flow on one or more bus 614.

Network communication module 618 comprises that being used to set up the various assemblies that are connected with maintaining network (for example, is used to realize the software of communication protocol, such as TCP/IP, HTTP, Ethernet, USB, FireWire

Deng).

Client software 620 is provided for realizing the client-side of popular personalized application and is used to carry out various component softwares in the various client-side functions of describing about Fig. 1-5 (for example, audio identification) on every side.In some implementations, can be in operating system 616 with the some or all of process integration carried out by client software 620.In some implementations, process can be at least in part with Fundamental Digital Circuit or with computer hardware, firmware, software, and perhaps the combination in any with them realizes.

Other uses 624 can comprise other software application arbitrarily, includes but not limited to: word processor, browser, Email, instant message, media player, phone software etc.

Detect advertisement and replay

Duplicate detection

When preparing the database that is used to search for, it is for using previously described descriptor to indicate that in advance repeated material is helpful.Repeated material can include but not limited to program, advertisement, sub-segment (for example, the stock market's news camera lens in the news program) of repetition etc.Use these signs, can present repeated material in the mode of not releasing all other materials that surpass the attention dimensions (for example, surpassing preceding 10-20 hits) of carrying out the user who searches for.The process 700 that describes below provides the mode that detected these copies before database is carried out any search inquiry.

Video ads removes

Broadcaster be the replay of embedded advertisement about one of complaint of allowing search and playback material.From the angle of broadcaster, this replay is disadvantageous: because it provides free advertisement to the advertiser, so it has directly reduced the value by the broadcasting of advertiser's payment.Unless remove old advertisement, and put new advertisement in position, otherwise benefit the material that they can not broadcasted before playing them again to return some modes of browsing rights and interests again to initial broadcaster.By searching repetition, also may combine with other standard (for example, duration, volume, visual behaviour, empty frame enclosure etc.), the process 700 that describes below provides a kind of mode that is used to detect embedded advertisement.

Video summary

A kind of mode that " summary " of non-repetitive program material if desired (promptly short version), acquisition are somebody's turn to do " summary " is to remove advertisement (as detecting by repeated material) and obtain from this material to be right after before the location advertising and immediately following the segment after location advertising.On radio and television, these positions in the program typically comprise " preheating " (before advertisement) and " review " (after advertisement).Forms if summary is the news program by the mixing that comprises non-repetitive and the non-ad material that repeats, then the material of Chong Fuing is typically corresponding to the primary sound broadcast (sound bite) of making a summary.These segments are lacked the information of the narration contribution of media event than the anchorman usually, thereby are the fine candidates that will remove.If summary is to be formed by narrative program (for example, film or serial part), then the audio tracks of Chong Fuing is typically corresponding to theme sound, atmosphere music or quiet.And, the fine segment that these will remove from the summary video typically.The process 700 that describes below provides a kind of mode that is used to detect the audio tracks of these repetitions, thereby can remove them from the summary video.

The duplicate detection process

Fig. 7 is the flow chart according to an embodiment of duplicate detection process 700.The step of process 700 needn't be finished with any specific order, and at least some steps can be carried out in multithreading or parallel processing environment simultaneously.

Process 700 is created audio statistics amount database (702) with the content of uploading etc. such as TV input, video since a group.For example, database can comprise the descriptor of 32 bit/frame, and is described as people such as Ke.Obtain inquiry and move from database, duplicate to check wherein (704) at database.In some implementations, get as inquiring about a short-movie of audio statistics amount is disconnected, and use hashing technique (for example, direct hash or position sensing hash (LSH)) bootup window to check aniso-coupling (coupling inequality), to obtain the short tabulation of possible sense of hearing coupling.In proof procedure, handle these candidate matches then, for example described as people such as Ke.Can with the corresponding content recognition of the candidate matches of being verified be duplicate contents (706).

With the strongest inconsistent coupling " growth " forward or backward in time, with starting point and the end point (708) that finds repeated material.In some implementations, this can use known Dynamic Programming technology (for example, Viterbi coding) to realize.In time in the expansion coupling forward, final time sheet in strong " seed " coupling is set to " coupling ", and is set to " not matching " about first final time sheet that is lower than the coupling of credible intensity of the identical database side-play amount between inquiry and the coupling.In some implementations, will be used as observed result in the coupling scoring of each frame between these two fixing points, and use the first rank Markov model that allows in the single conversion of state internal conversion and from " coupling " to " not matching " state.Can have a few arbitrarily and be set to l/L from matching unmatched transition probability, wherein L is the number of the frame between these two fixing points, corresponding in allowed limits to the minimum understanding of dislocation.Select the another kind of transition probability may be to use the coupling intensity distribution that this estimation is offset to early or later conversion.But this can increase the complexity of dynamic programming model, and unlikely improves the result, because coupling intensity has been used as the observed result in this time period.Use identical process to come to increase backward in time segment and mate (for example, only exchange past/future and move identical algorithm).

In some implementations, audio prompt and non-auditory information (for example, visual cues) are combined, to obtain higher coupling accuracy.For example, can verify the coupling (710) that (perhaps checking for the second time) utilizes the audio frequency coupling to find then by using simple visual similarity to measure.These tolerance can include but not limited to: color histogram (for example, the frequency of similar color in two width of cloth images), about the number on limit and the statistic of distribution etc.These needn't just calculate on entire image, also can calculate at the subregion of image, and with target image in corresponding subregion compare.

For the application of searching advertisement (contrasting with all types of repeated materials), the result that repeated material can be detected combines (712) with tolerance, and described tolerance purpose is advertisement and non-advertisement area are separated.These distinguishing characteristicss can depend on advertisement agreement (such as the duration (for example, 10/15/30 second intercut is very common)), (for example depend on volume, advertisement is tended to louder than program material on every side, if thereby repeated material all rings than the material at two ends, then it probably is exactly advertisement), (for example depend on visual behaviour, conversion more quickly between camera lens is tended in advertisement, and more action arranged within camera lens, if thereby repeated material is compared with the material at two ends bigger frame difference is arranged, then it probably is exactly advertisement) and depend on sky frame enclosure (local advertisement of being inserted not exclusively is full of usually to be propagated by country and be its position that stays, and consequently causes sky frame and quiet on length is 30 seconds the space of multiple).

In the identification advertisement, can analyze the material around this advertisement, and can generate statistic.For example, can generate about using specific intention (for example, image, literal) that specific products has been done the how many times advertisement or specific segment has been broadcasted how many inferior statistics.In some implementations, can remove one or more old advertisements or replace one or more old advertisements with new advertisement.At Covell, M., Baluja, S., Fink, M. AdvertisementDetection and Replacement Using Acoustic and Visual Repetition, IEEESignal Processing Society, MMSP 2006 International Workshop onMultimedia Signal Processing, October 3-6,2006, other technology of purposes of commercial detection and replacement has been described among the BC Canada, this article all is herein incorporated by reference.

In some implementations, can use from the content owner, come amplification procedure 700 and improve the coupling accuracy about the information of the detailed structure of content (for example, inserting ad material, repeated program etc. wherein wherein).In some implementations, can use the video statistics amount to determine to repeat and non-audio.In other implementation, can use the combination of video and audio statistics amount.

The audio fragment auction

In some implementations, the advertiser can participate in and present the relevant auction of audio frequency on every side, and described audio frequency is on every side wanted product sold or served relevant with this advertiser.For example, for obtain with its product or service with and the right that associates of " Song Feichuan " audio fragment on every side of being associated or descriptor, a plurality of advertisers can competitive bidding in auction.Then, when presenting around this theme audio frequency, the person of winning of auction can place some information (for example, sponsored link) of being correlated with before spectators.In some implementations, the advertiser can be to having the audio fragment competitive bidding on every side that first rank is described.For example, advertiser's competitive bidding can be at the audio frequency that is associated with television advertising (for example, this is the audio frequency that is associated with Ford Explorer television advertising), at the title that seals (for example, the title that shows " Yankees baseball "), at the position of program segment (for example, this audio frequency will occur 15 minutes in " Song Feichuan ", and after last advertisement gap, occur 3 minutes, before next advertisement gap, occur 1 minute), or at low-level sound or perceptual property (for example, " background music ", " session voice ", " explosion type " etc.).

In some implementations, when the user carries out when browsing other task of another website (for example, sponsored link), can be in the one or more popular personalized application of running background.The material relevant with media broadcast (for example, television content) can be used as the material relevant with another content source (for example, web site contents) and participates in same sponsored link auction.For example, the advertisement relevant with TV can be mixed with the corresponding advertisement of content of following current web page.

Can make various modifications to disclosed implementation, and still in the scope of following claim.

Claims

1. method comprises:

From audio statistics amount database generated query;

This inquiry of operation on audio statistics amount database is to determine aniso-coupling; And

If there is aniso-coupling,

To be duplicate contents then with the corresponding content recognition of the inquiry of being mated.

2. method as claimed in claim 1 further comprises:

Use non-auditory information to verify described aniso-coupling.

3. method as claimed in claim 1 further comprises:

Determine the end points of described duplicate contents.

4. method as claimed in claim 3 wherein uses the Dynamic Programming technology to determine this end points.

5. method as claimed in claim 1 further comprises:

Use tolerance to described duplicate contents, to determine whether described duplicate contents is advertisement.

6. method as claimed in claim 5, the set of measurements that wherein said tolerance comes free duration, volume, visual behaviour and empty frame enclosure to form.

7. method as claimed in claim 1, wherein said audio statistics amount are that audio fragment generates around the media broadcast.

8. method as claimed in claim 1, wherein said audio statistics amount is a frame descriptor.

9. method as claimed in claim 1 wherein uses video statistics amount and described audio statistics amount to come together to determine aniso-coupling.

10. method as claimed in claim 1 wherein uses hashing technique to determine described aniso-coupling.

11. a system comprises:

Processor;

Operationally be couple to processor and store the computer-readable medium of instruction thereon, when processor is carried out this instruction, make the operation of processor below carrying out:

From audio statistics amount database generated query;

Operation inquiry on audio statistics amount database, to determine aniso-coupling, wherein said audio statistics amount generates from content; And

If found aniso-coupling,

12. as the system of claim 11, the operation below wherein processor is further carried out:

Use non-auditory information to verify described aniso-coupling.

13. as the system of claim 11, the operation below wherein processor is further carried out:

Determine the end points of described duplicate contents.

14., wherein use the Dynamic Programming technology to determine described end points as the system of claim 13.

15. as the system of claim 11, the operation below wherein processor is further carried out:

16. as the system of claim 15, the set of measurements that wherein said tolerance comes free duration, volume, visual behaviour and empty frame enclosure to form.

17. as the system of claim 11, wherein said audio statistics amount is that audio fragment generates around the media broadcast.

18. as the system of claim 11, wherein said audio statistics amount is a frame descriptor.

19., wherein use video statistics amount and described audio statistics amount to come together to determine aniso-coupling as the system of claim 11.

20., wherein use hashing technique to determine described aniso-coupling as the system of claim 11.

21. a system comprises:

Be used for generating the device of audio statistics amount database from content;

Be used for from the device of audio statistics amount database generated query;

Be used on audio statistics amount database, moving inquiry to determine the device of aniso-coupling; And

If there is aniso-coupling,

Then be used for with the corresponding content recognition of the inquiry of being mated be the device of duplicate contents.

22. a computer-readable medium that stores instruction on it when processor moves this instruction, makes the operation of processor below carrying out:

From audio statistics amount database generated query;

Operation inquiry on audio statistics amount database is to determine aniso-coupling; And

If there is aniso-coupling,

23. a method comprises:

Generate the audio statistics amount database on every side that is associated with media broadcast;

From described database generated query;

Operation inquiry on audio statistics amount database is to determine aniso-coupling;

According to the identification of the just coupling between described inquiry and described audio statistics amount database duplicate contents;

Determine the end points of described duplicate contents;

Be identified in the content before or after the end points of described duplicate contents;

Generate statistic according to institute's content identified.

24. the method as claim 23 further comprises:

Use at least a tolerance to described duplicate contents, to determine whether described duplicate contents is advertisement.

25. as the method for claim 24, wherein said tolerance is associated with the length of described media broadcast.

26. as the method for claim 24, wherein said tolerance is associated with the volume of described media broadcast.