US7523081B1 - Method and apparatus for producing a signature for an object - Google Patents
Method and apparatus for producing a signature for an object Download PDFInfo
- Publication number
- US7523081B1 US7523081B1 US11/387,265 US38726506A US7523081B1 US 7523081 B1 US7523081 B1 US 7523081B1 US 38726506 A US38726506 A US 38726506A US 7523081 B1 US7523081 B1 US 7523081B1
- Authority
- US
- United States
- Prior art keywords
- random
- floating
- signature
- feature
- state vector
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related, expires
Links
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L9/00—Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols
- H04L9/08—Key distribution or management, e.g. generation, sharing or updating, of cryptographic keys or passwords
- H04L9/0891—Revocation or update of secret information, e.g. encryption key update or rekeying
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L9/00—Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols
- H04L9/06—Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols the encryption apparatus using shift registers or memories for block-wise or stream coding, e.g. DES systems or RC4; Hash functions; Pseudorandom sequence generators
- H04L9/0643—Hash functions, e.g. MD5, SHA, HMAC or f9 MAC
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L9/00—Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols
- H04L9/06—Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols the encryption apparatus using shift registers or memories for block-wise or stream coding, e.g. DES systems or RC4; Hash functions; Pseudorandom sequence generators
- H04L9/065—Encryption by serially and continuously modifying data stream elements, e.g. stream cipher systems, RC4, SEAL or A5/3
- H04L9/0656—Pseudorandom key sequence combined element-for-element with data sequence, e.g. one-time-pad [OTP] or Vernam's cipher
- H04L9/0662—Pseudorandom key sequence combined element-for-element with data sequence, e.g. one-time-pad [OTP] or Vernam's cipher with particular pseudorandom sequence generator
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L9/00—Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols
- H04L9/14—Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols using a plurality of keys or algorithms
-
- G—PHYSICS
- G11—INFORMATION STORAGE
- G11B—INFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
- G11B20/00—Signal processing not specific to the method of recording or reproducing; Circuits therefor
- G11B20/00086—Circuits for prevention of unauthorised reproduction or copying, e.g. piracy
- G11B20/00166—Circuits for prevention of unauthorised reproduction or copying, e.g. piracy involving measures which result in a restriction to authorised contents recorded on or reproduced from a record carrier, e.g. music or software
- G11B20/00181—Circuits for prevention of unauthorised reproduction or copying, e.g. piracy involving measures which result in a restriction to authorised contents recorded on or reproduced from a record carrier, e.g. music or software using a content identifier, e.g. an international standard recording code [ISRC] or a digital object identifier [DOI]
Definitions
- the present invention relates to techniques for estimating similarity between complex objects. More specifically, the present invention relates to a method and an apparatus that estimates similarity between complex objects by comparing object signatures for the complex objects.
- Charikar has applied the object signature technique to estimate the similarity between arbitrarily complex objects (see Moses S. Charikar, “Similarity Estimation Techniques from Rounding Algorithms,” Proceedings of the 34th Annual ACM Symposium on Theory of Computing, 2002).
- Charikar's model first computes an object signature for an object in a streaming manner, such that the elements of the object are fed one-by-one through the model, while maintaining an internal state for the object. More specifically, the model applies a hashing operation to each of the elements in the object, and the hashed value of the element is used to update the internal state for the object. When all elements of the object have been processed, the model uses the final internal state to compute a signature for the object. Note that the internal state for the object requires only a small amount of space, which in practice is independent of the size of the object.
- Charikar's model has a drawback. Specifically, while generating the object signature, Charikar's model tends to overemphasize the influence of multiple occurrences of an identical feature in an object. In other words, when the same feature occurs multiple times in the object, the influence of that feature on the resulting object signature increases dramatically, thereby degrading the utility of the object signature for many types of operations, such as comparisons.
- One embodiment of the present invention provides a system that produces an object signature for an object, wherein the object comprises a set of features.
- the system first initializes a k-dimensional state vector ⁇ s 1 , s 2 , . . . , s k ⁇ containing floating-point numbers.
- the system (1) computes a random-number seed from the feature; (2) generates k pseudo-random floating-point numbers ⁇ X 1 , X 2 , . . .
- each X i (i ⁇ [1, k]) is generated in accordance with an ⁇ -stable distribution, wherein 1 ⁇ 2; and (3) updates each floating-point number s i in the k-dimensional state vector using an associated pseudo-random floating-point number X i .
- the system then produces the object signature for the object by condensing the k-dimensional state vector. Note that using an ⁇ -stable distribution with 1 ⁇ 2 to generate the k pseudo-random floating-point numbers for each feature reduces the influence of multiple occurrences of a given feature on the object signature.
- the system initializes the k-dimensional state vector by setting each s i to zero.
- the system computes the random-number seed from the feature by hashing the feature to produce the random-number seed.
- the system generates the k pseudo-random floating-point numbers by: (1) seeding a pseudo-random number generator (PRNG) with the random-number seed; and (2) generating the k pseudo-random floating-point numbers from the PRNG.
- PRNG pseudo-random number generator
- the system updates each floating-point number s i in the k-dimensional state vector using an associated pseudo-random floating-point number X i by: (1) multiplying each X i with a predetermined feature weight w, wherein w is associated with the feature; and (2) adding the weighted X i to s i , such that s i ⁇ s i +wX i .
- the system produces the object signature for the object by condensing the k-dimensional state vector into a k-bit object signature.
- the system condenses the k-dimensional state vector into a k-bit object signature by converting each floating-point number s i into a single bit f si within the k-bit object signature such that: (1) if s i ⁇ 0, f si is set to 0; and (2) if s i ⁇ 0, f si is set to 1.
- the system compares the object signatures for a first object and a second object to estimate the similarity between the first object and the second object.
- FIG. 1 illustrates exemplary processes for producing object signatures for different objects using Charikar's model.
- FIG. 2 presents a flowchart illustrating the process of producing an object signature for an object in accordance with an embodiment of the present invention.
- FIG. 3A illustrates the similarity estimation results for S and S′, wherein S′ is obtained by replacing a fraction of the text document S with an identical term in accordance with an embodiment of the present invention.
- FIG. 3B illustrates the similarity estimation results for S and S′, wherein S′ is obtained by replacing a fraction of S with new, unique terms in accordance with an embodiment of the present invention.
- FIG. 3C illustrates the similarity estimation results for S and S′, wherein S′ is obtained by removing a fraction of the text document S in accordance with an embodiment of the present invention.
- a computer-readable storage medium which may be any device or medium that can store code and/or data for use by a computer system. This includes, but is not limited to, magnetic and optical storage devices such as disk drives, magnetic tape, CDs (compact discs) and DVDs (digital versatile discs or digital video discs).
- Charikar's model implements a “proximity-aware” hash function when condensing arbitrary weighted objects into associated object signatures.
- the hash function used by Charikar possesses a property that for two objects that are correlated with each other, the corresponding object signatures are also correlated.
- the correlation between the two object signatures which are both bit-strings is measured by the fraction of bit positions that agree. Hence, if the two objects are strongly correlated, their corresponding bit-strings will have a large overlap.
- the degree of correlation is normalized by computing a ratio of the magnitude of the intersection between two unweighted objects to the magnitude of the union of the two unweighted objects. This ratio is a number between zero and one, wherein completely uncorrelated objects give rise to a value of zero, whereas two identical objects give rise to a value of one.
- Charikar's model maintains a k-dimensional state vector for each object, wherein each object comprises a set of features and wherein k is a predetermined constant.
- This state vector is generally initialized before starting to compute the object signature of a new object.
- the process first computes an ordinary 32-bit hash of each feature in the object. This hash value is used to seed a pseudo-random number generator (PRNG).
- PRNG pseudo-random number generator
- the PRNG is configured to generate pseudorandom numbers in accordance with a Gaussian distribution with mean of zero and variance of one (i.e., N(0, 1) distribution).
- N(0, 1) distribution mean of zero and variance of one
- the model draws k pseudo-random numbers from the PRNG, wherein each of the pseudo-random number is a N(0,1) random variable. These pseudo-random numbers are then used to update the state vector.
- the contents of the state vector are used to construct the object signature for the object, which is a string of k
- FIG. 1 illustrates exemplary processes for producing object signatures for different objects using Charikar's model.
- Charikar's model 100 is used to generate object signatures for three different types of objects: a document 102 , an image file 104 , and database records 106 .
- Charikar's model generates distinct object signatures 108 , 110 , and 112 for objects 102 , 104 , and 106 respectively, wherein all the object signatures are k-bit strings.
- obtaining an object signature for an object involves taking into account the influence from each features in the object as well as repetitive occurrences of each feature in the object.
- FIG. 2 presents a flowchart illustrating the process of producing an object signature for an object in accordance with an embodiment of the present invention.
- object discussed herein can include but is not limited to: documents, database records and images.
- the system starts by receiving an object, for example, a text document (step 200 ).
- a text document can contain features such as terms and bi-grams, while an image file can contain features such as pixel elements or tiles.
- a more complex object comprises a larger number of features than a less complex object.
- a hundred-page text document generally contain more terms than a ten-page document.
- the system initializes the k-dimensional state vector by setting each term s i to zero. Note that the system can also initialize each term s i of the k-dimensional state vector to an initial value other than zero.
- the system computes a random-number seed from a feature in the set of features comprising the object (step 204 ). Specifically, the system applies a hash function on the feature, which generates a hash value or seed number associated with the feature, wherein the hash function is configured to generate independent, renormalized and distinct hash values for different features.
- the hash function is a collision-free hash function.
- each unique feature of the object can appear in the object multiple times. For example, if we process a hockey game-related text document in terms of bi-grams, the phrase “power play” is likely to occur more than once in the document. The system will process each occurrence of the same feature equally. Also note that, for multiple occurrences of the same feature, the system generates the same random-number seed using the hash function.
- the system then generates k pseudo-random numbers ⁇ X 1 , X 2 , . . . , X k ⁇ using the random-number seed, wherein each number X i is a floating-point number (step 206 ).
- the system seeds a pseudo-random number generator (PRNG) with the random-number seed computed from the feature and draws k pseudo-random numbers from the PRNG.
- PRNG pseudo-random number generator
- the PRNG is configured to generate each of the pseudo-random numbers X i in accordance with an ⁇ -stable distribution, wherein 1 ⁇ 2.
- An ⁇ -stable distribution has the property that the sum of independent ⁇ -stable distributed random variables is still an ⁇ -stable distributed random variable.
- the Cauchy distribution is characterized with a median of 0, and a half-width at half-maximum (HWHM) of 1.
- HWHM half-width at half-maximum
- the mean and variance of the Cauchy distribution are undefined.
- the system updates each term s i in the k-dimensional state vector using an associated pseudo-random number X i , (step 208 ). Specifically, the system first multiplies each X i with a predetermined feature weight w, wherein weight w is a user-provided weight associated with the feature. Next, the system adds the weighted pseudo-random number X i to term s i such that: s i ⁇ s i +wX i .
- the system repeats steps 204 - 208 for each occurrence of each feature of the object.
- the system processes the object in a streaming manner, such that elements of the object are processed sequentially.
- the influence of each element in the object is aggregated into the value of s i .
- s i contains the influences from all the features contained in the object.
- this process spreads out the influence of each feature across all k terms of the k-dimensional state vector.
- the addition step 208 reinforces the influence from that feature.
- the system produces the object signature for the object by condensing the k-dimensional state vector (step 210 ).
- the system condenses the k-dimensional state vector to a k-bit object signature. Specifically, the system converts each floating-point number s i into a single bit f si such that if s i ⁇ 0, f si is set to 0; if s i >0, f si is set to 1.
- the representation of object signatures from the k-dimensional state vectors is not limited to the k-bit string. For example, one can choose to convert each floating-point number s i in the state vector to two or more bits, thereby achieving higher resolution in object signature.
- the k-bit object signature of the object can now replace the original object so that various operations on the original object can be performed more efficiently on the bit-string. For example, one can compute a correlation between object signatures for two objects to estimate the similarity between the two objects. Specifically, computing the similarity between two bit-strings involves computing the Hamming distance between the two. As another example, when we need to find objects that are similar to a given object, we first compute the object signatures for these objects and compare them with the object signature of the given object. As yet another example, we can classify a set of objects into classes of similar objects by classifying the corresponding object signatures of the objects.
- k is chosen to be 64 or 128.
- the entries in the state vector are sums of independent random variables with an identical distribution. Also note that a sum of two independent random variables having an ⁇ -stable distribution with index ⁇ is still ⁇ -stable with the same index ⁇ .
- a Gaussian distribution is a 2-stable distribution, which means that for two normally distributed random variables X and Y with mean of zero, and standard deviations ⁇ and ⁇ , the sum of the two X+Y is still a Gaussian distribution.
- the new Gaussian distribution has a standard deviation of ⁇ square root over ( ⁇ 2 + ⁇ 2 ) ⁇ , which does not scale linearly with each individual random variable.
- a Cauchy distribution is 1-stable distribution, which means that for two Cauchy distributed random variables X and Y with mean of zero, the sum X+Y is still Cauchy distributed as 2 ⁇ .
- a new Cauchy distribution resulted from an addition operation scales linearly with each addition of another independent Cauchy distribution.
- the mean distribution does not obey the law of large numbers, and does not have an expectation. Consequently, by replacing the Gaussian distribution in the original Charikar's model with the Cauchy distribution for the random-number generation (step 206 ), the influence of multiple occurrences of a same feature on the object signature can be reduced.
- using an ⁇ -stable distribution with ⁇ somewhere between one and two can also improve the object signature accuracy in comparison to using the Gaussian distribution.
- FIG. 3 illustrates the results of comparing the performances between using the Cauchy distribution and the Gaussian distribution for similarity estimation of the two text documents S and S′ in accordance with an embodiment of the present invention.
- FIG. 3A illustrates the similarity estimation results for S and S′, wherein S′ is obtained by replacing a fraction of the text document S with an identical term in accordance with an embodiment of the present invention.
- S′ is generated from S by replacing a fraction f of positions in the set S with a same term “4711”, which simulates the effect of multiple occurrences of a same feature.
- Subplot 302 illustrates the result from using Cauchy distribution; while subplot 304 illustrates the result from the Gaussian approach.
- Cauchy distribution the result in subplot 302 demonstrates a linear decrease in similarity in response to the increasing fraction of difference f from zero to one, which is as expected.
- Gaussian distribution the result in subplot 304 demonstrates a highly nonlinear decrease in similarity with a linear increase of the fraction f.
- using Cauchy distribution improves the performance in this case.
- FIG. 3B illustrates the similarity estimation results for S and S′, wherein S′ is obtained by replacing a fraction of S with new, unique terms in accordance with an embodiment of the present invention.
- S′ is generated from S by replacing a fraction f of positions in the set S with new, unique terms, therefore the two sets are different in the fraction f.
- Subplot 306 illustrates the result from using Cauchy distribution; while subplot 308 illustrates the result from the Gaussian approach.
- Cauchy distribution the result in subplot 306 again demonstrates a linear decrease in similarity in response to the increasing fraction of difference f from zero to one, which is as expected.
- Gaussian distribution the result in subplot 308 demonstrates a highly nonlinear decrease in similarity with a linear increase of the fraction f Hence, using Cauchy distribution also improves the performance in this case.
- FIG. 3C illustrates the similarity estimation results for S and S′, wherein S′ is obtained by removing a fraction of the text document S in accordance with an embodiment of the present invention.
- S′ is generated from S by simply removing a fraction f of positions in the set S.
- Subplot 310 illustrates the result from using Cauchy distribution; while subplot 312 illustrates the result from the Gaussian approach. Note that there is no apparent performance difference between the two plots.
Landscapes
- Engineering & Computer Science (AREA)
- Computer Security & Cryptography (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Power Engineering (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
Claims (24)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US11/387,265 US7523081B1 (en) | 2006-03-22 | 2006-03-22 | Method and apparatus for producing a signature for an object |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US11/387,265 US7523081B1 (en) | 2006-03-22 | 2006-03-22 | Method and apparatus for producing a signature for an object |
Publications (1)
Publication Number | Publication Date |
---|---|
US7523081B1 true US7523081B1 (en) | 2009-04-21 |
Family
ID=40550467
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US11/387,265 Expired - Fee Related US7523081B1 (en) | 2006-03-22 | 2006-03-22 | Method and apparatus for producing a signature for an object |
Country Status (1)
Country | Link |
---|---|
US (1) | US7523081B1 (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120023112A1 (en) * | 2010-07-20 | 2012-01-26 | Barracuda Networks Inc. | Method for measuring similarity of diverse binary objects comprising bit patterns |
US9317753B2 (en) * | 2008-03-03 | 2016-04-19 | Avigilon Patent Holding 2 Corporation | Method of searching data to identify images of an object captured by a camera system |
US11755373B2 (en) | 2020-10-07 | 2023-09-12 | Oracle International Corporation | Computation and storage of object identity hash values |
-
2006
- 2006-03-22 US US11/387,265 patent/US7523081B1/en not_active Expired - Fee Related
Non-Patent Citations (3)
Title |
---|
Andrei Z. Broder, Identitying and Filtering Near-Duplicate Documents, 1998, Proc. FUN, 1-8. * |
Andrei Z. Broder, On the resmblance and containment of documents, 1997, IEEE, Computer Society, 21-29. * |
Moses S. Charikar, Similarity techniques from Rounding Algorithms, 2002, ACM, 1-58113-495-9/02, 380-388. * |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9317753B2 (en) * | 2008-03-03 | 2016-04-19 | Avigilon Patent Holding 2 Corporation | Method of searching data to identify images of an object captured by a camera system |
US9830511B2 (en) | 2008-03-03 | 2017-11-28 | Avigilon Analytics Corporation | Method of searching data to identify images of an object captured by a camera system |
US10339379B2 (en) | 2008-03-03 | 2019-07-02 | Avigilon Analytics Corporation | Method of searching data to identify images of an object captured by a camera system |
US11176366B2 (en) | 2008-03-03 | 2021-11-16 | Avigilon Analytics Corporation | Method of searching data to identify images of an object captured by a camera system |
US11669979B2 (en) | 2008-03-03 | 2023-06-06 | Motorola Solutions, Inc. | Method of searching data to identify images of an object captured by a camera system |
US20120023112A1 (en) * | 2010-07-20 | 2012-01-26 | Barracuda Networks Inc. | Method for measuring similarity of diverse binary objects comprising bit patterns |
US8463797B2 (en) * | 2010-07-20 | 2013-06-11 | Barracuda Networks Inc. | Method for measuring similarity of diverse binary objects comprising bit patterns |
US11755373B2 (en) | 2020-10-07 | 2023-09-12 | Oracle International Corporation | Computation and storage of object identity hash values |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Shen et al. | Multiview discrete hashing for scalable multimedia search | |
Pourkamali-Anaraki et al. | Preconditioned data sparsification for big data with applications to PCA and K-means | |
Wang et al. | Learning hash codes with listwise supervision | |
US8249361B1 (en) | Interdependent learning of template map and similarity metric for object identification | |
Halko et al. | An algorithm for the principal component analysis of large data sets | |
EP2638701B1 (en) | Vector transformation for indexing, similarity search and classification | |
US8533195B2 (en) | Regularized latent semantic indexing for topic modeling | |
Ross et al. | Normalized online learning | |
US8543598B2 (en) | Semantic object characterization and search | |
US8428397B1 (en) | Systems and methods for large scale, high-dimensional searches | |
Lee et al. | Maximum likelihood postprocessing for differential privacy under consistency constraints | |
US8184953B1 (en) | Selection of hash lookup keys for efficient retrieval | |
Safadi et al. | Descriptor optimization for multimedia indexing and retrieval | |
US9311403B1 (en) | Hashing techniques for data set similarity determination | |
US11972329B2 (en) | Method and system for similarity-based multi-label learning | |
US20150039538A1 (en) | Method for processing a large-scale data set, and associated apparatus | |
Tono et al. | Efficient DC algorithm for constrained sparse optimization | |
US9069634B2 (en) | Signature representation of data with aliasing across synonyms | |
US11138515B2 (en) | Data analysis device, data analysis method, and recording medium | |
Vitaladevuni et al. | Efficient orthogonal matching pursuit using sparse random projections for scene and video classification | |
CN115795065A (en) | Multimedia data cross-modal retrieval method and system based on weighted hash code | |
US7523081B1 (en) | Method and apparatus for producing a signature for an object | |
US20090094198A1 (en) | Method and system for identification of repeat print jobs using object level hash tables | |
JP4143234B2 (en) | Document classification apparatus, document classification method, and storage medium | |
Hong et al. | Fried binary embedding: From high-dimensional visual features to high-dimensional binary codes |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: GOOGLE, INC., CALIFORNIA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:LARS ENGEBRETSEN;REEL/FRAME:017721/0943 Effective date: 20060316 |
|
STCF | Information on status: patent grant |
Free format text: PATENTED CASE |
|
FPAY | Fee payment |
Year of fee payment: 4 |
|
FPAY | Fee payment |
Year of fee payment: 8 |
|
AS | Assignment |
Owner name: GOOGLE LLC, CALIFORNIA Free format text: CHANGE OF NAME;ASSIGNOR:GOOGLE INC.;REEL/FRAME:044101/0610 Effective date: 20170929 |
|
FEPP | Fee payment procedure |
Free format text: MAINTENANCE FEE REMINDER MAILED (ORIGINAL EVENT CODE: REM.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY |
|
LAPS | Lapse for failure to pay maintenance fees |
Free format text: PATENT EXPIRED FOR FAILURE TO PAY MAINTENANCE FEES (ORIGINAL EVENT CODE: EXP.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY |
|
STCH | Information on status: patent discontinuation |
Free format text: PATENT EXPIRED DUE TO NONPAYMENT OF MAINTENANCE FEES UNDER 37 CFR 1.362 |
|
FP | Lapsed due to failure to pay maintenance fee |
Effective date: 20210421 |