WO2010064263A1 - Identifiant multimédia - Google Patents

Identifiant multimédia Download PDF

Info

Publication number
WO2010064263A1
WO2010064263A1 PCT/IS2009/000014 IS2009000014W WO2010064263A1 WO 2010064263 A1 WO2010064263 A1 WO 2010064263A1 IS 2009000014 W IS2009000014 W IS 2009000014W WO 2010064263 A1 WO2010064263 A1 WO 2010064263A1
Authority
WO
WIPO (PCT)
Prior art keywords
descriptors
data
descriptor
scenes
database
Prior art date
Application number
PCT/IS2009/000014
Other languages
English (en)
Inventor
Fridrik Heidar Asmundsson
Herwig Lejsek
Bjorn Thor Jonsson
Kristleifur Dadason
Laurent Amsaleg
Original Assignee
Haskolinn I Reykjavik
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Haskolinn I Reykjavik filed Critical Haskolinn I Reykjavik
Priority to US13/132,597 priority Critical patent/US9047373B2/en
Priority to DK09801289.1T priority patent/DK2370918T5/da
Priority to EP09801289.1A priority patent/EP2370918B1/fr
Publication of WO2010064263A1 publication Critical patent/WO2010064263A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content

Definitions

  • This invention relates to efficiently performing a close-duplicate search within large collections of data streams, preferably in the context of Multimedia (audio and video files or streams).
  • a fully automatic content-based video identification system builds a reference database of low-level features (called descriptors) extracted from the videos to protect. Then, video streams to check are fed into the system to detect near- or close-duplicate content. Those streams can originate from the web (via a robot), from a TV-broadcast or from a camera installed in front of a multimedia device; a service such as YouTube could even submit uploaded videos to the system.
  • descriptors low-level features
  • US 2006/083429 and EP 1650683 disclose the DPS2 CBCR video retrieval system using a local descriptor type, which are 20 dimensional on the lower end of the dimensionality scale.
  • the local descriptors used in the DPS 2 are based on the Harris interest point detector, and encode both, spatial and temporal information around the selected interest point.
  • frames of interest are selected, similar to the detection of interest points in the descriptor extraction process. This is achieved by examining the rate of motion in the source video material at a pixel level, selecting frames where overall motion in the frame is either at its least or at its most, compared to the frames around it.
  • constant time refers to the time required to retrieve the same amount of data from a datastore.
  • Data stores in computers are hierarchically organized (Register - Ll cache - L2 cache - L3 cache - main memory - secondary storage) and one access to such a data store takes a specified constant amount of time and retrieves a constant amount of locality close data.
  • constant time access to a secondary data store may have slightly different access time depending on where the data is located - nonetheless, it is still regarded as being constant.
  • NV-tree refers to the nearest vector tree data structure used in the implementation of preferred embodiments of the present invention, where the nearest vector is the closest matching and most similar data descriptor.
  • WO 2007/141809 discloses the NV-tree data structure incorporated herein as a reference.
  • Selection size refers to the number of descriptors indexed in a database such as the NV-Tree.
  • LSH Locality Sensitive Hashing
  • Data store refers to any primary (RAM) or secondary memory such as hard drives, tapes, optical disks, flash-memory drives and any other secondary storage media used in the computer industry.
  • interest point is one of many points in a given image which fulfils a certain criteria of the signal around this point.
  • One very popular method to detect interest points in images and video is to find local pixel-level maxima in the amount of visual contrast.
  • local descriptors refers to a method to abstract and encode the signal around a specific interest point of complex data, such as images, video or audio in a numeric vector of high dimensionality. Local descriptors only cover the information around the selected interest point, as opposed to global descriptors, which encode certain features on a whole image or video frame or even just a single video file.
  • the Eff 2 local descriptor is an improved version of the SIFT local descriptor (US Patent 6,711,293 (March 23, 2004)), which have shown to be significantly more distinctive than SIFT when searching in very large image collections, helping both retrieval efficiency and effectiveness.
  • the EfP local descriptors are described in Herwig Lejsek, Fri ⁇ rik Hei ⁇ ar Asmundsson, Bj ⁇ rn P ⁇ r J ⁇ nsson, Laurent Amsaleg: Scalability of local image descriptors: a comparative study. ACM Multimedia 2006: 589-598.
  • All frames according to the present invention are preferably snapshots of the data stream and are preferably taken at fixed time intervals. From a single frame, preferably every single frame, preferably a maximum number of local descriptors (up to 1000) are extracted. Those descriptors are preferably just extracted from the information contained in this single frame and preferably not from the information of the neighbouring frames.
  • the DPS 2 system examines first the rate of motion in the source video material at a pixel level, selecting just frames where overall motion in the frame is either at its least (a minimum) or at its most (a maximum), compared to the frames around it. Just inside those maximum or minimum frames points of interest are computed. For the extraction of the descriptors at and around these interest points, the DPS 2 system integrates also the signal change compared to the neighbour frames into the descriptor computation.
  • the descriptors of the DPS 2 system are therefore not solely based on the information of one selected frame, but also on the change of the signal within its neighbours.
  • the search time in the present invention is preferably constant, for example when using the "NV-tree” data structure ("Nearest Vector tree” data structure; see definition above) or approximately constant (in case LSH (Locality Sensitive Hashing) data structure is used) and independent from the collection size.
  • NV-tree Nearest Vector tree
  • LSH Location Sensitive Hashing
  • the DPS 2 system builds upon the sorting of descriptors along a Hubert space filling curve.
  • NV-tree or LSH
  • the DPS 2 system builds upon the sorting of descriptors along a Hubert space filling curve.
  • the position for the query descriptor the space filling curve is determined.
  • the Hubert space filling curve can preserve locality fairly well, the area around this position on the space filling curve is evaluated. All descriptors residing in this area are compared to the query descriptors by computing the exact distance. If the distance is below a certain threshold, it is considered to be a descriptor match.
  • a method for feeding information of a data into a database.
  • the information of the data may be fed from a data file or a data stream into an existing data base or a new data base may be created.
  • the method according to the present invention preferably comprises at least one of the following steps.
  • the method preferably comprises the step of collecting frames from a data stream and extracting or calculating high dimensional descriptors representing a content within the frame.
  • the data stream is preferably a video stream or an audio stream, wherein a video steam may also comprise audio data.
  • a group of high dimensional descriptors is extracted or calculated from individual frames, preferably from each single frame.
  • a high dimensional descriptor preferably comprises 15 or more dimensions.
  • representative descriptors are selected from a set of consecutive frames, by filtering out redundant descriptors.
  • the selected descriptors are preferably put into sets, where each set of descriptors preferably represents a consecutive range of frames that overall contains the same or very similar content. These sets are preferably regarded as describing a scene.
  • the selected descriptors are preferably labelled in the set(s) with at least an identifier. For instance, only the representative descriptors in a set of consecutive frames are labelled with a corresponding identifier; the "non-representative" or "redundant" descriptors are preferably deleted or not labelled by an identifier.
  • a cluster is preferably formed or defined, on the basis of the sets of descriptors.
  • the cluster may be compared with other clusters created from the same data stream. This provides the preferred advantage, that redundant descriptor may be further eliminated. For instance, two clusters which are "similar", e.g., which relate to a similar scene, are combined. As such, a cluster can preferably consist of a single set or a combination of sets of descriptors.
  • the identifiers of the descriptors retained from the previous step are stored on one or more storage areas and/or on one or more storage devices. It is further preferred that the identifiers are organized on the one or more storage areas and/or on the one or more storage devices such that identifiers of spatially closely located descriptors (within the high dimensional space) are preferably or with high likelihood stored within the same area or device or same device partition, which provides further advantages (see discussion below).
  • a method for tagging or identifying a data stream.
  • the method refers to querying an unknown data or data stream to a database of known data streams.
  • an unknown data stream is analysed by applying some of the method steps of the first aspect and a comparison with stored results in an existing database.
  • the method for querying an unknown data stream comprises at least one of the following method steps.
  • the method preferably comprises the step of collecting frames from the unknown data stream and extracting or calculating high dimensional descriptors representing a content within the frame. This is preferably done in a same or similar manner as described with regard to the first aspect.
  • representative descriptors may be selected from a set of consecutive frames, by filtering out redundant descriptors. Then, preferably, at least a set of selected descriptors may be created, where a set of descriptors representing a high percentage of the same consecutive frames, are defined as describing a scene. Again, this is preferably done in a same or similar manner as described with regard to the first aspect.
  • a cluster which comprises at least one descriptor, preferably a plurality of descriptors are formed or defined.
  • cluster information is compared with the stored information in the database. For instance by by querying every single descriptor of a query cluster or at least a subset of descriptors of the query cluster within the storage devices of the database which comprises reference clusters of descriptor identifiers.
  • a next preferred step relates to retrieving all spatially closely related identifiers and aggregating the related identifiers and preferably aggregating these identifiers depending on their scene origin.
  • a next preferred step when an identifier yields a high score it is determined that the query descriptors are retrieved from cluster containing similar data. Then the retrieved clusters are preferably evaluated by performing a regression test on the retrieved clusters, wherein a decision of a matching date result is determined when the regression signal exceeds a predefined threshold.
  • a computer program or suite of computer programs are provided so arranged, such that when executed on one or more processors the program or suite of programs cause(s) the one or more processors to perform the methods above.
  • Further preferred embodiments of the invention are exemplified by the following items:
  • a method for creating a database of data streams comprising the following steps;
  • identifiers of the descriptors retained from the previous step on one or more storage devices wherein the identifiers are organized on one or more storage devices such that identifiers of spatially closely located descriptors are with high likelihood stored within the same device partition.
  • Multimedia Identifier is a computer system designed to detect and identify close-duplicate (multimedia) data streams in an automated manner.
  • the system is very robust towards strong modifications of the content in the stream, such as compression, frame dropping, down-sampling (as well as brightness and colour-changes, cropping and affine transformations for image and video data).
  • the robustness of the Multimedia Identifier system results from the use of high-dimensional descriptors, which describe local interest points extracted from the frames of audio and/or video data. These descriptors are preferably found in regions with high contrast, so that they will presumably be found again in modified versions of the same stream.
  • the signal around those interest points is then encoded into a descriptor (numerical vector) of a fixed, high dimensionality. Similar regions (in respect to the content) in the data stream will be encoded in numerically similar descriptors; e.g., similar means that the Euclidean distance between them will be relatively small. Alternatively, other distance measures like the Manhattan or Hamming distance measures can be used.
  • the descriptors are preferably extracted on a fixed frame rate, preferably of about 0.1- 30 frames per second depending on the amount of information in the data stream. As neighbouring frames are likely to contain similar content, they also yield many highly similar descriptors.
  • the degree of spatial similarity also defines the degree of compression (scalability), as quickly changing data streams contain little duplicate information therefore yielding more descriptors. Instead consistent data streams which change their information just slightly over time can be filtered very strongly of redundant data and therefore yield less or even very few descriptors.
  • a first filter program may calculate the distances between all pairs of descriptors, preferably inside a given time- window, wherein each time-window preferably consisting of a fixed number of frames. Based on those distances, each descriptor builds up a list of neighbours. A neighbour descriptor is thereby defined when it has a distance smaller than a pre-defined distance threshold. Furthermore, a score for each descriptor may be calculated based on the number of close neighbours and their distance to the candidate descriptor. All descriptors within the time window are preferably sorted based on this score.
  • the descriptor in the top position of this sorted list is selected into the candidate set of the most representative descriptors for the given time window. All its close neighbours may then given a penalty to their score, the remaining set of descriptors may be resorted and the top descriptor is chosen again. This procedure may continue until the desired number of representatives is reached.
  • the selected representative descriptors are preferably grouped into scenes of 30-2000 descriptors and sent to the database, preferably to a distributed NV-tree database, either to be inserted or as part of a query.
  • An advantageous feature of the system according to the present invention is the distributed high-dimensional index system.
  • This database index system preferably comprises the following three main components: - a coordinator node which takes in query requests: a set of descriptors representing small part of a data stream to be queried for, e.g. a single frame or a scene - typically 30-2000 descriptors.
  • Insertions are preferably performed in a similar way, just that the designated worker inserts or feeds the descriptors into their sub-index instead of returning query results.
  • This form of distribution is a general method and can also be applied to other approximate high-dimensional data-structures handling local descriptor queries, such as Locality Sensitive Hashing (LSH).
  • LSH Locality Sensitive Hashing
  • each worker machine providing the query processing can handle a single hash table, or part of one, from the whole set of hash tables, typically 6-100 hash tables in total.
  • the coordinator coordinator node
  • the workers again send the partial results to the aggregator (aggregation node).
  • the results from several consecutive scenes are preferably aggregated.
  • the result lists are evaluated (preferably in sorted time order) and checked whether a correlation exists between the query scenes and the matching result scenes (which have been inserted into the database before); it is preferred to use linear regression for such an evaluation.
  • the decision unit of the Multimedia Identifier according to the present invention automatically decides that the given query-data stream matches to the result sequence contained in the database.
  • the present invention relates to a system and a method to be executed by one or more processor in a computer program, which comprises preferably a plurality of the following phases (some of the phases may be optional depending on the specific embodiment of the present invention): 1.
  • Those descriptors are vectors of high dimensionality (15 or more dimensions) that describe part of the content within the frame.
  • descriptors which are retained by the first filter step in 2 are then grouped according to the frames they represent. In case several descriptors represent a high percentage of the same consecutive frames they can be regarded as describing a scene.
  • a scene can be defined as a range of consecutive frames, where the information inside the stream is static or changes just very little. All descriptors in the same scene are preferably labelled with the same identifier.
  • the temporal locality of each scene (begin and end) is preferably stored in a separate lookup file on a storage device.
  • the descriptor set representing a scene is preferably compared to a (small) set of previous examined scenes. In case a strong correlation between two scenes can be detected, those scenes are preferably merged or linked together and redundant descriptors may be further eliminated. 5.
  • the phases 1 to 5 are preferably used in order to extract information of the data stream and feed or insert the extracted information into a database or to create a database. Moreover, the above method steps are also preferably used for the identification of an unknown data stream. Moreover, for the identification also phases 6 and/or 7 are preferably used.
  • a method of querying these storage devices with a set of descriptors preferably from the same scene/frame, retrieving all spatially closely related identifiers and aggregating these identifiers on a scene basis.
  • a set of descriptors preferably from the same scene/frame
  • retrieving all spatially closely related identifiers and aggregating these identifiers on a scene basis In case one scene identifier yields an above expectancy amount of hits, it can be concluded that the query descriptors were retrieved from a scene containing close-duplicate content.
  • the method for searching through data volumes preferably comprises a plurality of the following method steps (again, some of the method steps may be optional depending on the specific embodiment according to the present invention): a) receiving external task request, b) coordinating distribution of tasks, by means of coordinator device, where the external task request can be either a data query request or a data insertion request. Then c) a receiving task is performed, by means of one or more worker devices, followed by a d) processing task requests from the coordinator unit, by one or more worker devices, where the task can be either an insertion of data descriptors into a high-dimensional data structure, a query of single element within the data structure, or a query of sequence of elements within the data structure.
  • a e) receive task requests is performed, by means of aggregation device, from the one or more worker devices, where the task contains query result information from each of the worker devices, followed by f) sorting the query results from each of the workers, g) aggregating the results and h) generating a detection report from the results.
  • the embodiment is characterized in that the data is a file or data stream which can be described by a set of numerical vectors in high-dimensional space and the data structure residing on each worker device is a high- dimensional data structure.
  • the aggregation device aggregates the results first on a data descriptor bases and secondly on a data descriptor group bases.
  • a computer readable data storage medium for storing the computer program or at least one of the suites of computer programs of any of embodiments above.
  • Fig. 1 shows a flowchart of a system according to the present invention
  • Fig. 2 shows a flowchart of the methods used in the system of the present invention
  • Fig. 3 shows a flowchart of a first filtering method according to the present invention
  • Fig. 4 shows a flowchart of a second filtering method according to the present invention
  • Fig. 5 shows a flowchart of a further filtering method with a 2-phase-filtering process according to the present invention
  • Fig. 6 shows a flowchart of an aggregation method according to the present invention
  • Fig. 7 shows an example for calculating and evaluating the decision coefficient of a regression window.
  • FIG. 1 shows an embodiment of the present invention in form of a system for searching through data volumes.
  • the system comprises a coordinator device 2 for receiving external task request and coordinate distribution of tasks, where the external task request can be either a data query request Ia or a data insertion request Ib.
  • the external task request can be either a data query request Ia or a data insertion request Ib.
  • one or more worker devices 3a-n for receiving and processing task requests from the coordinator device 2, where the task can be either an insertion of data into a searchable data structure, a query of single element within the data structure, or a query of sequence of elements within the data structure.
  • a result aggregation device 4 for receiving task requests from the one or more worker devices 3a-n, where the task contains query result information from each of the worker devices.
  • the result aggregation device sorts the query results from each of the workers, aggregates the results, and generates a detection report 5 from the results.
  • the embodiment is characterized in that the data is a file or data stream, which can be described by a set of numerical vectors in high-dimensional space.
  • the data structure residing on each worker device is a high-dimensional data structure, where the aggregation device aggregates the results first on a data descriptor bases, and secondly on a data descriptor group bases.
  • the devices are computer processor enabled devices capable of executing program instructions.
  • Figure 2 outlines the methods used in a preferred system according to the present invention.
  • the method for extracting data descriptors for insertion/query of data preferably comprises the steps of: a) receiving data from a data file 6a or a data stream 6b, b) dissembling the data 7 into chunks of data 8, c) extracting data descriptors 9 from the chunks of data 8, d) selecting interest points 10 within the chunk of data 8, and e) encoding 11 the interest points as a descriptor (numerical vector).
  • the method for inserting encoded data descriptors into high-dimensional data structure preferably comprises the following steps: a) receiving encoded data descriptors 11, b) determining which workers device 3a-3n the data descriptor belongs to, transmit the data descriptors to the workers device, and c) inserting the data descriptor into the high-dimensional data structure residing at the workers device.
  • the method for querying high-dimensional data structure using encoded data descriptor preferably comprises the following steps: a) receiving encoded data descriptors 11, b) determine which workers device 3a-3n the data descriptor belongs to, transmit the data descriptors to the workers device, c) querying the high-dimensional data structure residing at the workers device with the data descriptor.
  • Figure 3 shows the first filtering step of the Multimedia Identifier, starting from a set of high-dimensional descriptors extracted from j consecutive frames 101.
  • the process starts by inserting all descriptors from 101 into a set of LSH hash tables 102.
  • the LSH hash tables group those descriptors together that are likely to be close in distance.
  • the system calculates the exact distance to all descriptors that landed in the same LSH-hash buckets 104. In case this exact distance falls below a fixed distance threshold r, the descriptors are linked together via their individual neighbour lists and both of their priorities are increased relative to the inverse of their distance 105.
  • the descriptors of the k central frames are inserted into a priority queue p 106. As long as this queue is neither empty, nor the result set has exceeded a maximum size s 107, the descriptor d with the highest priority is removed from p and added to the result set 108.
  • Figure 4 shows the second filtering step of the Multimedia Identifier from a set of s high-dimensional descriptors (from the first filtering step) 201 sorted by priority within a queue q. From those s descriptors a maximum of r descriptors shall be retained after this second filtering step. Therefore a counter descsLeft is established and initialized with maxDescs 202. As long as this counter is larger than 0 and q is not empty 203, the descriptor d with the highest priority is removed from q 204. This descriptor d is checked against a second set of LSH hash tables (not the same as in 102/104) for potential neighbours in previous filtering rounds (see 309/310) and variable dF (distinct factor) is set to 1.0 205.
  • ni For each potential neighbour ni found within the LSH hash table buckets 206 the distance between d and n> is calculated. In case this distance is larger than a minimum threshold g 207 ni is not regarded any further. Otherwise all direct neighbours e j of d (identified in 105) are checked against n f and an upper distance threshold between e j and ni is calculated (via the triangle in equation). In case this distance threshold is smaller than g, e ⁇ is also added to the direct neighbour list of n ( 208. These added neighbours are later used to identify scene cuts in the media stream.
  • the final distinct factor dF is evaluated.
  • descsLeft is decreased by 1 and descriptor d is added into the LSH hash tables and to the resultSet 216.
  • descsLeft is decreased by (1.0 - dF) 213 and - in case there was not a nearly identical descriptor of d already in the LSH hash table 214 - d is reinserted into priority queue q, however with priority decreased by factor dF 215.
  • Figure 5 shows a 2- phase-filtering process which is usually repeated over a whole stream of consecutive frame windows, thereby shifting the evaluation window of j consecutive frames for an amount of k frames to the right and again restarting the filter process.
  • Fig. 5 a showing the filtering process over a whole data stream 301 of frames and 302 the selection of j consecutive frames within this stream (basis of the first filter 101).
  • a second time-window of k frames 303 (as used in 103) is determined.
  • a next step the two windows in the data stream are moved for k frames 304 and all descriptors of the new central window 305 are also run against the first filter 308. As the result set 310 no longer is empty, the descriptors surpassing the first filtering step are also run against the second filter 309 and the resulting descriptors are added to the result set 310.
  • the frame windows are again shifted 306 and the descriptors within the center window 307 are run against filter 1 and 2 before being added to the result set 310.
  • This loop repeats until the data stream ends or the result set 310 exceeds a predefined number of descriptors.
  • the result set is split up into smaller sets of descriptors (311-314), each set containing between 30 and 2000 descriptors.
  • This splitting procedure is designed to be optimal in terms that the total of all sets span over a minimum amount of frames the descriptors are extracted from or have neighbours to. Therefore the data stream is first cut into scenes 315. The scene borders are thereby determined where the neighbourhood of descriptors between frames yields a minimum (using the neighbour lists created in 105 and 208).
  • the descriptors of the result set are assigned to one or more scenes.
  • a minimum threshold of descriptors are assigned to more than a single scene, those scenes are merged together (under consideration of the maximum threshold for descriptor buckets, otherwise they are just linked together). Most often this merging leads to larger continuous scenes e.g. 311, sometimes however they are also split (see especially 313, which separate the otherwise continuous scenes 312 and 314).
  • Scenes such as 313 are recognized of representing highly similar content, such as refrains in songs or an interview within a TV show.
  • Each bucket of descriptors is finally handed together with the scene information to the database in order to query or insert the descriptors.
  • Figure 6 describes how to aggregate a stream of result lists, each list entry containing a scene identifier sceneID and an associated weight 401 and determine a matching signal.
  • the probability prob is calculated and the index variable i is set to 0 402.
  • the probability prob expresses that a certain descriptor does not yield a match when querying a random single descriptor in the database.
  • the first result list is withdrawn from the stream and the index Variable i is increased by 1 404.
  • a binomial distribution bin is initialized with the number of query descriptors and the probability as parameter 404.
  • the result list is evaluated for potential signals.
  • the first scene identifier - weight pair is selected from the (already in respect to weight sorted) list 405 and the inverse cumulative distribution function icdf of the weight within the binary distribution bin is calculated 406.
  • a very small minimum threshold (e.g 1/1000000) is selected and the icdf is compared to it 407. Just when icdf falls below this threshold it is considered for the regression process.
  • Each scene identifier - weight pair surpassing this filter must then be inserted into at least one regression window, therefore initializing a flag yet_inserted with false 408. Then the pair is checked again all yet existing regression windows 409.
  • Each regression window rw is assigned to a scene range begin.. end defined by its representative.
  • the result list identifier i is added to the scene identifier - weight pair and this triple ⁇ i, sceneID, weight> is inserted into the regression window rw 411. Furthermore, the begin-end boundaries of rw are updated according to the newly inserted triple, the decision coefficient d c (example in 606 and 607) is recalculated and the flag yet_inserted is set true. In case the scene identifier - weight pair does not fit with any existing regression window 412, a new regression window is created from this pair 413.
  • the regression windows are sorted based on their decision coefficient 414. If one of the decision coefficients is larger than 1.0 415 the scenes defined by this regression window (begin up to end) are declared as matching the query stream 416. In case no decision coefficient is larger than 1.0 the next result list is drawn from the stream, or - in case the maximum threshold maxResult has been reached - the process is stopped and the query stream is declared to have no match in the database 417.
  • Figure 7 gives an example for calculating and evaluating the decision coefficient of a regression window.
  • the individual queries are drawn 601
  • the actual scenes in the database are specified 602 (comment: The scale of the horizontal axis is monotonic but not accurate).
  • 18 points are drawn displaying the scene identifier - weight pairs that have undercut the probability threshold t 407. All those points have been inserted into regression windows as outlined in 409-413. This yielded two big regression windows (603 has 10 entries, 604 has 5 entries) and three small ones that consist just of a single ⁇ queryID, scenelD, weight> triple 605. For all those regression windows a decision coefficient dc is calculated.
  • a regression window In order to yield a meaningful correlation coefficient a regression window needs to contain a minimum number of 4-10 triples and span over a minimum interval of 3-5 units on the horizontal as well on the vertical axis. If this is not the case, the decision coefficient is set to 0. Obviously for all three regression windows 605 this is the case and they can therefore be disregarded. Regression window 1 shown in 603 is, however, very strongly correlated. In 606 expectation values and variance along the two axis are calculated. Together with the covariance they are needed for calculating the correlation coefficient p which is with 0.9667 close to the maximum of 1.0. Multiplying the correlation coefficient with the cumulative weight n yields after evaluating 8 queries (result lists) a decision coefficient higher than 1.0 for the first time.
  • the query can be stopped therefore and the query stream is declared matching the scenes 778-784 in the database.
  • Regression window 2 in 604 has only 5 triples, its suspect to lower threshold.
  • Calculating the correlation coefficient p 607 confirms this, as it yields just 0.37.
  • the decision coefficient dc could already be dismissed based on this very low p, but even when it is calculated it would yield 0.1032 which is far below the matching threshold of 1.0.
  • the descriptor extraction process takes in a video file or, alternatively, a video stream and splits it into a sequence of still images (so-called frames). For each second of video material a fixed number of those frames is chosen and sent towards the descriptor extraction process.
  • the descriptor extractor identifies many stable interest points in the frame and encodes the signal around those points in a numerical vector of high dimensionality (descriptor).
  • the descriptors are processed in a time dependent manner.
  • a time window of 0.25-5 seconds defines the actual position in the filtering step. All descriptors extracted from frames belonging to this window are inserted into a Locality Sensitive Hashing data structure (LSH) in order to identify clusters of close- duplicate descriptors.
  • LSH Locality Sensitive Hashing data structure
  • a sub-window of size a is created around the window centre. All descriptors belonging to frames inside this sub-window are now queried for their neighbours within a radius r and each such neighbour adds to the total score of the descriptor, depending on its distance from the original descriptor.
  • All descriptors selected in the first round are also inserted in a second, separated set of LSH hash tables.
  • the descriptor is filtered out (removed), replaced with a link to its close-duplicate and the counter c is updated with a distance dependent value.
  • the linkage between the descriptors in the second filtering step is used to identify scenes of similar content.
  • the number of descriptors can be reduced significantly. Keeping track of in which frames close-duplicate descriptors appear and vanish, we can identify stable scenes (scenes without much change in the video) and transitioning scenes. The descriptors are then assigned to one or more of such scenes based on this information. If several descriptors fall within the same group of scenes, those scenes can be regarded as visual similar and based on this information links between those scenes can be created, even if those scenes are non-adjacent. Those links can be again used for further compression, preferably for videos that should be inserted in the database, as the time-window for filtering might be significantly large, in some cases even covering the whole video.
  • the previously into scenes grouped descriptors are queried for or inserted into the high-dimensional index.
  • a set of NV-trees (alternatively LSH-hash tables) is used for this purpose.
  • For each individual descriptor up to 1000 neighbouring descriptors are retrieved. Each retrieved neighbour yields a vote for the scene it belongs to.
  • the result sets of all descriptors from the same query scene are then aggregated on a votes-for-scene basis. This aggregated list is then sorted in descending order, so that the scene which got the most votes stays on top of the list. These ranked result lists are then checked, for one or more strong signals.
  • n P(OiJn(X 1 .. X n ) > x) (1 - Bin m p (x)) m... number of query descriptors in a group n... number of descriptor groups inside the NV-tree database p... number of nearest neighbours retrieved per descriptor / n x... accepted error probability rate (rate of potential false positives)
  • the candidates that pass this signal detector can be compared in detail to the frame sequence the query descriptor groups represents.
  • One very fast - but still effective - method is to compare the relative locations of the matching points between a representative query frame and the points represented in the NV-tree database.
  • the remaining signals are passed on to further aggregation.
  • the descriptor extraction workload can be distributed onto multiple CPUs - or specialized hardware such as a graphics processor - to reach and exceed real-time performance.
  • the parallelization becomes more complex, especially because the database index structure can undergo significant changes over time due to the insertion or deletion of entries.
  • extremely large collections of high-dimensional data > 1 billion descriptors
  • require a significantly large amount of RAM-memory > 16 GB in order to provide high query and insertion throughput.
  • the coordinator node functions as a server program which waits for incoming queries or insertions. Typical requests consist of a set of 30 - 2000 local descriptors. Once such a query or insertion has been received, the server starts processing the individual descriptors of that specific operation while it continues listening for new requests.
  • the server takes the descriptors one by one and traverses the first part of the index structure. For an NV-tree the upper hierarchies of the tree are traversed until a node is reached which contains the address of the assigned worker node; respectively for LSH the first i digits of the hash key are computed, which are then used to look up the responsible worker's address in a table).
  • the assigned worker unit is then sent a packet containing the descriptor information so that the processing can continue at this node.
  • the coordinator sends the descriptor information to one specific worker for each of those index structures. Once this is done, the next descriptor is taken from the query set until all descriptors have been processed and the thread terminates.
  • the Worker units wait for incoming requests from the coordinator. Once a packet is received, the descriptor is sent towards the lower part of the index structure (a sub NV-tree or a subset of all hash table buckets for LSH) residing on the worker machine and a ranked result list of close descriptor identifiers is retrieved from the final leaf node or hash bucket. The ranked identifier list is then transmitted to the result aggregation unit. The worker now resumes waiting.
  • the index structure a sub NV-tree or a subset of all hash table buckets for LSH
  • the result aggregation unit waits for incoming result packets from the workers.
  • result sets from a single descriptor can come from several worker machines (dependent on the number of different index structures)
  • the result aggregation units sorts all incoming packets and aggregates the results, first on a single descriptor basis and afterwards on a scene basis.
  • a list of scene identifiers and its weights are returned. This weighted list is sorted and run through a signal detector as specified in (3) and sends the results to scene match aggregator, which looks at the results of consecutive scenes (frames). 5.
  • Scene match aggregator :
  • the output of the result aggregation unit in (4) is transmitted to the scene match aggregator. First it sorts the scene results based on the time line, and starts to perform a correlation analysis. It checks whether there is a strong statistical correlation between the consecutive query scenes (frames) and the matches of those queries in the database. As the consecutive query scenes are can be associated with certain time intervals in the query stream and the results scenes can be also associated with time intervals of videos inserted before, a regression profile can be easily calculated. As the results obtained from the result aggregation unit are weighted, those weights are also incorporated into the calculation of the regression profile.
  • a defined threshold of 3-30 scene queries needs to be evaluated.
  • the results on those queries are then separated into regression windows which must not exceed a defined time interval (proportional to the time window of the queries).
  • regression windows contain at least a minimum of 3-20 results a reliable correlation coefficient p can be calculated.
  • This correlation coefficient is usually combined with the signal strength of the detected results (in percentage of the query descriptors), as they give the second level of confidence on the matching.
  • a match is declared.
  • d c > threshold e.g. threshold 1
  • this recall threshold is chosen, the probability of yielding false- positive matches is decreased while the probability of losing a correct match is increased.
  • any further evaluation of the query clip can be stopped and declared as a no-match in case the correlation coefficient is very low and falls below a defined threshold (e.g. 0.01 for the above example). If the coefficient falls in between the two thresholds the evaluation of the clip continues until either the correlation reaches above or falls below one of the two thresholds or the query stream ends. All matches which surpass the upper threshold are stored in a standard relational database system and can be displayed in a detailed report on which previously indexed database content the queried material was matched to.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

La présente invention concerne l'exécution efficace d'une recherche de copie proche au sein de grandes collectes de flux de données, de préférence dans le contexte du multimédia (fichiers ou flux audio et vidéo). Dans un premier aspect, l'invention concerne un procédé permettant l'alimentation d'informations d'une donnée à partir d'un fichier de données ou d'un flux de données dans une base de données. Dans un deuxième aspect, l'invention concerne un procédé permettant le marquage ou l'identification d'un flux de données en demandant une donnée inconnue ou un flux de données inconnu à une base de données de flux de données connus. Dans un troisième aspect, l'invention concerne un programme informatique ou une suite de programmes informatiques permettant d'exploiter les procédés de cette invention. La robustesse du système d'identifiant multimédia de la présente invention résulte de l'utilisation de descripteurs de grandes dimensions qui décrivent les points locaux d'intérêt extraits des trames de données audio et/ou vidéo.
PCT/IS2009/000014 2008-12-02 2009-12-02 Identifiant multimédia WO2010064263A1 (fr)

Priority Applications (3)

Application Number Priority Date Filing Date Title
US13/132,597 US9047373B2 (en) 2008-12-02 2009-12-02 Multimedia identifier
DK09801289.1T DK2370918T5 (da) 2008-12-02 2009-12-02 Multimedie-identifikator
EP09801289.1A EP2370918B1 (fr) 2008-12-02 2009-12-02 Identifiant multimédia

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
IS8771 2008-12-02
IS8771 2008-12-02

Publications (1)

Publication Number Publication Date
WO2010064263A1 true WO2010064263A1 (fr) 2010-06-10

Family

ID=42027955

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IS2009/000014 WO2010064263A1 (fr) 2008-12-02 2009-12-02 Identifiant multimédia

Country Status (4)

Country Link
US (1) US9047373B2 (fr)
EP (1) EP2370918B1 (fr)
DK (1) DK2370918T5 (fr)
WO (1) WO2010064263A1 (fr)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2012001485A1 (fr) * 2010-06-30 2012-01-05 Alcatel-Lucent Usa Inc. Procédé et appareil permettant de gérer un contenu vidéo
US9578394B2 (en) 2015-03-25 2017-02-21 Cisco Technology, Inc. Video signature creation and matching
WO2017213705A1 (fr) * 2016-06-10 2017-12-14 Google Llc Utilisation de la comparaison audio et vidéo pour déterminer l'âge du contenu
US10015541B2 (en) 2015-03-25 2018-07-03 Cisco Technology, Inc. Storing and retrieval heuristics

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2012006740A1 (fr) * 2010-07-14 2012-01-19 Research In Motion Limited Procédés et appareil de lissage d'animations
US20140015855A1 (en) * 2012-07-16 2014-01-16 Canon Kabushiki Kaisha Systems and methods for creating a semantic-driven visual vocabulary
US9524184B2 (en) 2012-07-31 2016-12-20 Hewlett Packard Enterprise Development Lp Open station canonical operator for data stream processing
US9990758B2 (en) * 2014-03-31 2018-06-05 Intel Corporation Bounding volume hierarchy generation using a heterogeneous architecture
US11269951B2 (en) * 2016-05-12 2022-03-08 Dolby International Ab Indexing variable bit stream audio formats
CN109684518B (zh) * 2018-11-02 2021-09-17 宁波大学 一种可变长度哈希编码的高维数据最近邻查询方法

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2007141809A1 (fr) * 2006-06-06 2007-12-13 Haskolinn I Reykjavik Exploration de données par utilisation d'une arborescence d'index créée par projection récursive de points de données sur des lignes aléatoires

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2002007432A (ja) * 2000-06-23 2002-01-11 Ntt Docomo Inc 情報検索システム
US20070033163A1 (en) * 2003-05-30 2007-02-08 Koninklij Philips Electronics N.V. Search and storage of media fingerprints
WO2007103583A2 (fr) * 2006-03-09 2007-09-13 Gracenote, Inc. Procédé et système de navigation entre des média
US20080189299A1 (en) * 2007-02-02 2008-08-07 Ulrich Karl Heinkel Method and apparatus for managing descriptors in system specifications
US20080201379A1 (en) * 2007-02-19 2008-08-21 Victor Company Of Japan, Ltd. Method and apparatus for generating digest of captured images
US8719288B2 (en) * 2008-04-15 2014-05-06 Alexander Bronstein Universal lookup of video-related data

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2007141809A1 (fr) * 2006-06-06 2007-12-13 Haskolinn I Reykjavik Exploration de données par utilisation d'une arborescence d'index créée par projection récursive de points de données sur des lignes aléatoires

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
ALEXIS JOLY ET AL: "Content-Based Copy Retrieval Using Distortion-Based Probabilistic Similarity Search", IEEE TRANSACTIONS ON MULTIMEDIA, IEEE SERVICE CENTER, PISCATAWAY, NJ, US, vol. 9, no. 2, 1 February 2007 (2007-02-01), pages 293 - 306, XP011157474, ISSN: 1520-9210 *
GENGEMBRE N ET AL: "A Probabilistic Framework for Fusing Frame-based Searches within a Video Copy Detection System", PROC. OF THE ACM INTERNATIONAL CONFERENCE ON IMAGE AND VIDEO RETRIEVAL, NIAGARA FALLS, CANADA, 7 July 2008 (2008-07-07), pages 211 - 220, XP002491420, ISBN: 978-1-60558-070-8, [retrieved on 20080707] *
KRISTLEIFUR DADASON, H. L. ET AL: "Eff2 Videntifier: Identifying Pirated Videos in Real-Time", 23 September 2007 (2007-09-23), pages 1 - 2, XP002574945, Retrieved from the Internet <URL:http://hal.archives-ouvertes.fr/docs/00/17/58/74/PDF/de21e-dadason.pdf> [retrieved on 20100322] *
ROYER J ET AL: "Multimedia interactive services automation based on content indexing", BELL LABS TECHNICAL JOURNAL, WILEY, CA, US, vol. 13, no. 2, 21 June 2008 (2008-06-21), pages 147 - 154, XP001514357, ISSN: 1089-7089 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2012001485A1 (fr) * 2010-06-30 2012-01-05 Alcatel-Lucent Usa Inc. Procédé et appareil permettant de gérer un contenu vidéo
JP2013536491A (ja) * 2010-06-30 2013-09-19 アルカテル−ルーセント ビデオコンテンツを管理するための方法および装置
US9578394B2 (en) 2015-03-25 2017-02-21 Cisco Technology, Inc. Video signature creation and matching
US10015541B2 (en) 2015-03-25 2018-07-03 Cisco Technology, Inc. Storing and retrieval heuristics
WO2017213705A1 (fr) * 2016-06-10 2017-12-14 Google Llc Utilisation de la comparaison audio et vidéo pour déterminer l'âge du contenu

Also Published As

Publication number Publication date
EP2370918B1 (fr) 2019-05-22
EP2370918A1 (fr) 2011-10-05
US9047373B2 (en) 2015-06-02
US20110302207A1 (en) 2011-12-08
DK2370918T5 (da) 2019-09-02
DK2370918T3 (da) 2019-08-19

Similar Documents

Publication Publication Date Title
EP2370918B1 (fr) Identifiant multimédia
US20200401615A1 (en) System and methods thereof for generation of searchable structures respective of multimedia data content
EP3709184B1 (fr) Procédé et appareil de traitement d&#39;ensemble d&#39;échantillons, et procédé et appareil de recherche d&#39;échantillons
US10956484B1 (en) Method to differentiate and classify fingerprints using fingerprint neighborhood analysis
US7151852B2 (en) Method and system for segmentation, classification, and summarization of video images
US9672217B2 (en) System and methods for generation of a concept based database
US9575969B2 (en) Systems and methods for generation of searchable structures respective of multimedia data content
US8594392B2 (en) Media identification system for efficient matching of media items having common content
US10831814B2 (en) System and method for linking multimedia data elements to web pages
KR20080075091A (ko) 실시간 경보 및 포렌식 분석을 위한 비디오 분석 데이터의저장
Trad et al. Large scale visual-based event matching
Liu et al. Integration of global and local information in videos for key frame extraction
US10180942B2 (en) System and method for generation of concept structures based on sub-concepts
US10360253B2 (en) Systems and methods for generation of searchable structures respective of multimedia data content
Doulamis et al. Efficient video summarization based on a fuzzy video content representation
EP2608059A1 (fr) Procédé et appareil pour prioriser des métadonnées
Ren et al. Hierarchical modeling and adaptive clustering for real-time summarization of rush videos in trecvid'08
Lin et al. Mining high-level features from video using associations and correlations
Bhaumik et al. Content coverage and redundancy removal in video summarization
Bailer Efficient Approximate Medoids of Temporal Sequences
Dimitrovski et al. Video Content-Based Retrieval System
Dutta et al. Indexing Video Database for a CBVCD System
Fegade et al. Content-based video retrieval by genre recognition using tree pruning technique
Qi et al. The study on the feedback of large scale content-based video retrieval
Viitaniemi et al. Concept-based video search with the PicSOM multimedia retrieval system

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 09801289

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

WWE Wipo information: entry into national phase

Ref document number: 2009801289

Country of ref document: EP

WWE Wipo information: entry into national phase

Ref document number: 13132597

Country of ref document: US