US7801868B1 - Surrogate hashing - Google Patents
Surrogate hashing Download PDFInfo
- Publication number
- US7801868B1 US7801868B1 US11/784,012 US78401207A US7801868B1 US 7801868 B1 US7801868 B1 US 7801868B1 US 78401207 A US78401207 A US 78401207A US 7801868 B1 US7801868 B1 US 7801868B1
- Authority
- US
- United States
- Prior art keywords
- file
- hash value
- data
- data contents
- url
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related, expires
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/50—Information retrieval; Database structures therefor; File system structures therefor of still image data
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L61/00—Network arrangements, protocols or services for addressing or naming
- H04L61/45—Network directories; Name-to-address mapping
- H04L61/457—Network directories; Name-to-address mapping containing identifiers of data entities on a computer, e.g. file names
Definitions
- the present invention relates generally to software architecture. More specifically, surrogate hashing is described.
- the Internet, World Wide Web, and other types of data networks may be used to find information. Specific information is typically sought using these sources by conducting a search. Searches are conducted for various reasons such as research, education, personal interest, rights management, and others. However, while a large amount of information is available from various sources and services on these networks, the approach used by search service providers and the amount of data (either raw or returned in searches) renders conventional search techniques problematic with regard to accuracy, efficiency, and latency.
- File may refer to a physical or logical grouping of data and as such, the file may or may not exist physically. Files may also refer to directory structures or data. A file can have text associated with it such as a reference on a web page (e.g., link, in-line image, and the like), metadata attached to the file, or another resource with text in proximity to or associated with the file reference. If a search is performed using keywords that correspond to the associated text of the file, then the file or file location is delivered as a search result.
- This conventional approach is used when searching for files (such as an image file) on the Internet.
- the service provider's search engine has no knowledge of the contents of the file searched for. Instead, numerous results are returned based on text associated with the file intending to return files that accurately match a search request. However, the file is neither analyzed nor checked to ensure that it matches a user's desired search.
- an intellectual property rights management organization e.g., law firm, agency
- a conventional search engine to search a network such as the Internet for the image in question.
- Conventional techniques typically associate the word “Madonna” with an image file.
- automatic search solutions then attempt to analyze the text to determine whether the text indicates the image is similar to the image being sought.
- the analysis of text associated with a file is neither accurate nor efficient. With each search result returned, a user must download the file in its entirety and manually evaluate the file. In the example cited, this approach forces the user to wade through thousands of pictures of other Madonnas such as the biblical Mary.
- the image files often require additional manual review to determine which image files match a protected image of the popular singer. If a match is determined, then the image is identified as a copy and rights may be enforced. However, there may be additional copies of the protected image online, but if the indicated text is not found associated with the file, then a match can not be determined and rights may not be enforced.
- a company may be trying to determine if its computer program is being distributed illegally on a network. Leveraging conventional solutions, the company would search based on text possibly associated with the computer program (e.g., “Get ABC's computer program here for free”). Once again, the files returned in the search are neither analyzed nor checked by the search engine to ensure that they match a user's desired search. There may be copies of the computer program that are never returned in the search results because the copies are not associated with text or because the associated text does not match the search request. For returned search results, manual review of a large amount of data is again required to determine if the files found in a search match those of the proprietary computer application.
- text possibly associated with the computer program e.g., “Get ABC's computer program here for free”.
- FIG. 1 illustrates an exemplary system for surrogate hashing, in accordance with an embodiment
- FIG. 2 illustrates an exemplary application architecture for surrogate hashing, in accordance with an embodiment
- FIG. 3 illustrates an exemplary overall process for surrogate hashing, in accordance with an embodiment
- FIG. 4A illustrates an exemplary overall process for surrogate hashing, in accordance with an embodiment
- FIG. 4B illustrates exemplary processing of a URL from a Local URL collection, in accordance with an embodiment
- FIG. 4C illustrates an exemplary process for parsing a URL, in accordance with an embodiment
- FIG. 4D illustrates an alternative exemplary overall process for surrogate hashing, in accordance with an embodiment
- FIG. 5 illustrates an exemplary computer system suitable for surrogate hashing, in accordance with an embodiment.
- Surrogate hashing may be performed by evaluating a sampling or portion (“portion”) of a file's data contents.
- surrogate hashing may refer to the selection of a standardized portion of a file to determine whether, based on hash values, a selected file is similar to another file. Standardization may be performed systematically and repeatedly to ensure the same portion is taken the next time an identical file is encountered so that hashes are comparable.
- a portion may be selected from one or multiple parts of a file, including the beginning, middle, or end of a file, or a combination thereof.
- the data chosen to comprise a portion may be sequential or non-sequential.
- a portion may also include the whole file.
- surrogate hashing may refer to hashing a portion of a file to determine if another file has the same hash value or set of values. One or more hash values may be generated from a portion to determine whether a given file matches another file.
- a file may be a group of data for various types of computing systems, including binary, tertiary, quantum, textual, hexadecimal, octal, and others.
- the group of data may represent an image, photo, graphic, video, audio, computer program or application (“application”), text, or some other data structure.
- a file may refer to a physical or logical grouping of data and as such, the file may or may not exist physically.
- a portion of a file may be analyzed to generate multiple (e.g., two (2) or more) hash values to identify a given file without the risk of collision. And in still other examples, multiple hash values may be concatenated together.
- More than one hash may be used to minimize the risk of collisions (i.e., a different file having the same hash value) and to avoid mistakenly identifying a file.
- file identification may be performed quickly and accurately.
- Functions such as image searching, rights management, and others, may be performed without delay or omission errors (i.e., failing to return a match when a match should be indicated), and with few or no matching errors (i.e., mistakenly matching two different images).
- Surrogate hashing may be performed in various environments and is not limited to the use of Hosts, Uniform Resource Locators (“URLs”), crawlers, or the other exemplary environments described herein.
- URLs Uniform Resource Locators
- FIG. 1 illustrates an exemplary system for surrogate hashing, in accordance with an embodiment.
- system 100 includes crawlers 102 - 106 , network 108 , content servers 110 - 118 , and storage system 120 .
- the number, type, configuration, and implementation of system 100 and the elements shown may be varied and are not limited to the examples given.
- system 100 may be used to implement the described file identification techniques but may be varied in design, implementation, configuration, and other aspects and features.
- Crawlers 102 - 106 may be implemented on computers and processors, including networked computing devices, notebook computers (i.e., laptops), mobile computing devices such as personal digital assistants, smart phones, or other wired or wireless computing devices.
- Content servers 110 - 118 may be implemented as application, web, or other types of servers that, when connected to a network, provide information at various locations and addresses (e.g., uniform resource locators (URLs)) accessible from network 108 .
- Crawlers 102 - 106 may be configured to process domains or hosts (“hosts”), web pages, or other data files (collectively referred to as “files”) located on content servers 110 - 118 , which is described in greater detail below in connection with FIGS. 4A-4D .
- URLs may be addresses or indicators of a file location regardless of system, network, or application protocol. Links may be references to URLs and are not limited to the example used.
- crawlers 102 - 106 may be computer programs or applications (“applications”) that are designed to search for content by processing files located at a given address and, in some examples, traversing links to other files at the given address according to various types of data processing techniques and structures (e.g., processing pages and links using a tree-structure, and others).
- Network 108 may be implemented as the Internet, a LAN, WAN, MAN, WLAN, or other type of data network over which data may be exchanged, transferred, downloaded, sent, received, and the like.
- the techniques described herein are not limited to the type of data network from which files are retrieved or the protocols used to support those networks and may be varied without limitation to the example shown.
- Storage 120 may be implemented using one or more physical or logical data stores, databases, storage arrays (e.g., SAN), redundant arrays of independent disks (e.g., RAID), data warehouses, clustered storage systems, storage systems using volatile and/or non-volatile storage, storage networks, or other type of data storage formats or facilities and may be varied without limitation to the example shown.
- a database management system may be used.
- relational database structures and languages may be implemented to enable files, portions of files, hashes, hash values, and other data relating to file searching, indexing, and management to be stored on storage 120 .
- techniques described herein may be implemented as software, hardware, circuitry, or a combination thereof.
- software may be implemented using various programming, scripting, formatting, or other computer programming languages, including C, C++, Java, machine code, assembly, Fortran, XML, HTML, and others.
- C C++
- Java machine code
- Assembly Fortran
- XML XML
- HTML HyperText Markup Language
- FIG. 2 illustrates an exemplary application architecture for surrogate hashing, in accordance with an embodiment.
- application 200 may include logic module 202 , input module 204 , crawler interface (I/F) 206 , hash module 208 , and database system I/F 210 .
- application 200 may be implemented as software, hardware, circuitry, or a combination thereof.
- software may be implemented using various programming, scripting, formatting, or other computer programming languages, including C, C++, Java, machine code, assembly, Fortran, XML, HTML, and others.
- Application 200 is not limited to any particular language or format and its design, architecture, implementation, and operation may be varied apart from the given description.
- logic module 202 may guide the operation of application 200 , receiving user input via input module 204 , sending/receiving data over crawler I/F 206 from crawlers 102 - 106 processing files found on content servers 110 - 118 ( FIG. 1 ), running hashing algorithms to generate hash values for files identified, and storing/retrieving data from storage 120 ( FIG. 1 ) using database system (DBS) I/F 210 .
- Logic module 202 may also provide some, all or none of the applications, structure, or functionality of crawlers 102 - 106 . As an example, a search may be initiated by providing a copy of the file desired to be found via input module 204 .
- a portion of the file is hashed (i.e., hash algorithms are run against the data in the portion of the file) to generate one or more hash values.
- more than one hashing algorithm may be run in order to reduce collisions (i.e., different values having the same hash value or set of values).
- multiple hash values are concatenated together to produce a stronger hash value.
- the hash values are compared to those stored in storage 120 . If the hash values generated for the file being sought match hash values of a file stored in storage 120 , a location for the file associated with the hash values stored in memory is provided. Thus, other copies of a file (i.e., authorized, unauthorized, copyrighted, or otherwise protected or unprotected) may be found.
- hash values stored in storage 120 are generated from portions of files found by crawlers 102 - 106 .
- crawlers 102 - 106 are directed to a location (e.g., website, URL, or other type of file address) and begin processing and traversing directories, links, URLs, and files associated with the given location.
- crawlers 102 - 106 (via crawler I/F 206 ) may continuously or non-continuously process and traverse directories, links, URLs, and files at various locations to continue to store hash values associated with files and locations (e.g., addresses, URLs, and the like) on storage 120 .
- Files may be manually or automatically provided using various types of interfaces (e.g., graphical user interface (GUI), a system administration interface, command line interface (CLI), and others).
- GUI graphical user interface
- CLI command line interface
- Logic module 202 may be configured to run one or more hashes (i.e., hashing algorithms) to generate one or more hash values associated with the file.
- hashes i.e., hashing algorithms
- two, three, or more hashes may be run instead of a single hash in order to minimize collisions (i.e., to avoid generating the same hash value for different files).
- multiple hashing algorithms i.e., hashes
- a new hash value may be generated using one or more hashing algorithms that individually identify the different files without conflict. Further, by generating individualized hash values associated with a given value, a file may be accurately matched to a copy of the file. For example, storage 120 may have 80 billion hashes and locations (e.g., URLs). If a file is sought, a hash value is generated for the file, which is then used for a search of storage 120 to determine whether the same hash is found. If a match of the hash value or set of values for the file is found, the location is returned, which identifies the location of the file associated with the hash values stored in storage 120 .
- FIG. 3 illustrates an exemplary process for surrogate hashing, in accordance with an embodiment.
- File identification may be performed using the below-described process, which may also be varied and is not limited to the description provided.
- a file is received for a search ( 302 ).
- a file may be submitted using a user interface (UI), command line interface, or other application for providing the file to application 200 ( FIG. 2 ).
- UI user interface
- command line interface or other application for providing the file to application 200 ( FIG. 2 ).
- a portion of the file is selected for analysis ( 304 ).
- portions are “standardized,” which refers to identifying a consistent set, part, or sub-set of data that is selected from a file.
- Standardized portions may be identical in size and location (e.g., 128 bits of data selected from the first (i.e., “front end”) 128 bits of a file) or may be identical to other files.
- the use of standardized portions ensures that substantially similar portions or segments of data are selected for evaluation to help enhance finding a match.
- “standardized” may be different and is not limited to the example given above.
- a standardized portion of data may be selected based on size or location of a discrete set, sub-set, part, or other group of data chosen from a file. For example, the first 128 bits of data of a file may be identified and used as a standardized portion that is selected from each file against which a hashing algorithm (e.g., MD 2 , MD 4 , MD 5 , SHA 1 , SHA 2 , and others) may be run. As another example, an extremely small portion (e.g., less than 128 bits) of data may be used as a standardized portion. In some examples, an extremely small portion of data or dataset may refer to any group or size of data that may be used to generate a hash value.
- a hashing algorithm e.g., MD 2 , MD 4 , MD 5 , SHA 1 , SHA 2 , and others
- dataset may refer to a collection of data without regard to structure, function, logic, or any attribute or characteristic other than collecting a group of data together.
- an “extremely small” portion of data may, in some examples, refer to the smallest group of data that may be used to generate a unique hash value.
- using extremely small portions of data enables rapid processing of portions (i.e., hashing) of files and, subsequently, rapid processing of a large population of files.
- data of any size may be used and is not limited to extremely small portions of data as described above.
- one or more hashing algorithms are run against the standardized portion to generate one or more hash values ( 306 ). If one hashing algorithm is run, a single hash value may be produced. However, if multiple hashing algorithms are run, then multiple hash values are produced, which may be used individually or in combination to identify a given file. In some examples, multiple hashing algorithms are run to minimize collisions. Here, minimizing collisions refers to the process of generating one or more hash values to individually identify a file without the risk of another, different file having the same set of hash values. After generating the one or more hash values, stored hash values are searched to determine whether a match exists ( 308 ).
- FIGS. 4A-4F An example of developing hash values for storage and use in searches is described below in connection with FIGS. 4A-4F .
- different techniques for finding, generating, and storing hash values may be implemented apart from those described in connection with FIGS. 4A-4F .
- a search is performed to determine if the same hash value or set of hash values exist ( 310 ). If the same hash value or set of hash values are not found in storage 120 , then the process ends. If the same hash value or set of hash values are found in storage 120 , then the location for the file associated with the hash value or set of hash values is returned ( 312 ). In other examples, the above-described process may be varied and is not limited to the description given.
- FIG. 4A illustrates an exemplary overall process for surrogate hashing, in accordance with an embodiment.
- a crawler instance i.e., an instantiation of a web crawler, bot, or substantially similar application
- a storage facility database, data warehouse, or the like ( 402 ).
- Local variables are initialized, including hosts, Local URLs (i.e., URLs that link to other internal files of a host), and Foreign URLs (i.e., URLs that link to files on other hosts) collections ( 404 ).
- initialization of local variables may include other variables and collections used to decide if a URL should be processed currently or stored (i.e., in storage 120 ) for later processing instead of processing Local URLs or Foreign URLs.
- initialization of local variables may include variables and collections which support URLs being processed currently or URLs being stored for later processing. Initialization may be performed to make collections of local variables (e.g., Local URLs, Foreign URLs, hosts) available to determine whether a URL is included in a collection. In other embodiments, initialization of local variables may be performed differently. After local variables are initialized, a host is retrieved, including associated local URLs (e.g., links that lead to other pages associated with the location, URL, or website), for processing ( 406 ). The retrieved URL is then processed ( 408 ). Processing a URL against a Local URLs collection is described in greater detail below in connection with FIG. 4B .
- a determination is made as to whether another URL exists to be processed ( 410 ). If another URL is available for processing, then it is processed from the Local URLs collection ( 408 ). However, if no further URLs are detected for processing, then the local URLs are stored (in storage 120 ( FIG. 1 )) along with the hashed values associated with each local URL ( 412 ). Foreign URLs are also stored for future processing in storage 120 ( 414 ). The process then repeats with initializing local variables prior to retrieving another Host to process ( 404 ). In some embodiments, the above-described process may be performed repeatedly on some, none, or all URLs found by registered crawlers as directed. In other embodiments, the above-described process may be varied in design, implementation, execution, and is not limited to the example provided.
- FIG. 4B illustrates exemplary processing of a URL from a Local URL collection, in accordance with an embodiment.
- a file found at a given URL may be retrieved and hashed.
- a determination is made as to whether a file indicates there are additional files that need to be downloaded ( 420 ). If no further files are available for download, then a determination is made to download a standardized (i.e., as described above) portion of a file to be hashed ( 422 ). However, if a file contains data indicating other additional files for download (i.e., html, directory listing, or other), then the remainder of the file is downloaded ( 424 ). URLs are parsed to capture additional file location data indicated in 420 , as described in greater detail below in connection with FIG. 4C ( FIG. 426 ).
- the file is hashed to calculate hash values ( 428 ).
- the calculated hash values are then stored locally with the given URL for later storage in storage 120 ( 430 ).
- the above-described process may be varied and is not limited to the description provided above.
- FIG. 4C illustrates an exemplary process for parsing a URL, in accordance with an embodiment.
- a URL is parsed out to break up an address into constituent parts in order to standardize the URL into a standard address form that can be checked against a collection ( 440 ).
- the URL is standardized into a given format for an address that can be checked against a collection ( 442 ).
- a determination is made as to whether the URL is in an existing collection ( 444 ). If the URL is not found in an existing collection (e.g., Local URLs, Foreign URLs, and others), then a determination is made as to whether the URL is Local or Foreign ( 446 ).
- the URL is a local URL ( 444 )
- it is added to a Local URLs collection ( 448 ).
- the URL is a foreign URL, then it is added to a Foreign URLs collection ( 450 ).
- a further determination is made as to whether there is another URL in the file ( 452 ). If another URL is found, then the process is repeated. If another URL is not found, then the process ends.
- the decision to process a URL currently or at a later time may be based on information other than if the URL is Local or Foreign.
- URLs may be processed currently or stored for later processing. Other data or collections may be used to support this decision.
- the above-described process may be varied and is not limited to the example shown and described.
- FIG. 4D illustrates an alternative exemplary overall process for surrogate hashing, in accordance with an embodiment.
- a first portion of a first file is hashed to generate (i.e., calculate) a first hash value ( 460 ).
- the hash value is stored (e.g., in storage 120 ( FIG. 1 )) ( 462 ).
- a URL is received and processed ( 464 ), from which a second file is retrieved ( 466 ).
- a second portion of the second file is hashed to generate (i.e., calculate) a second hash value ( 468 ).
- the first hash value and the second hash value are compared to determine whether they are substantially similar ( 470 ).
- determining whether the first hash value and the second hash value are substantially similar may include determining whether the first hash value and the second hash value are the exact same value. In other embodiments, determining whether the first and the second hash value are substantially similar may include the first and second hash values being different, albeit slightly.
- FIG. 5 illustrates an exemplary computer system suitable for surrogate hashing, in accordance with an embodiment.
- computer system 500 may be used to implement computer programs, applications, methods, processes, or other software to perform the above-described techniques.
- Computer system 500 includes a bus 502 or other communication mechanism for communicating information, which interconnects subsystems and devices, such as processor 504 , system memory 506 (e.g., RAM), storage device 508 (e.g., ROM), disk drive 510 (e.g., magnetic or optical), communication interface 512 (e.g., modem or Ethernet card), display 514 (e.g., CRT or LCD), input device 516 (e.g., keyboard), and cursor control 518 (e.g., mouse or trackball).
- processor 504 system memory 506 (e.g., RAM), storage device 508 (e.g., ROM), disk drive 510 (e.g., magnetic or optical), communication interface 512 (e.g., modem or Ethernet card), display 514
- computer system 500 performs specific operations by processor 504 executing one or more sequences of one or more instructions stored in system memory 506 . Such instructions may be read into system memory 506 from another computer readable medium, such as static storage device 508 or disk drive 510 . In some examples, hard-wired circuitry may be used in place of or in combination with software instructions for implementation.
- Non-volatile media includes, for example, optical or magnetic disks, such as disk drive 510 .
- Volatile media includes dynamic memory, such as system memory 506 .
- Transmission media includes coaxial cables, copper wire, and fiber optics, including wires that comprise bus 502 . Transmission media can also take the form of acoustic or light waves, such as those generated during radio wave and infrared data communications.
- Computer readable media includes, for example, floppy disk, flexible disk, hard disk, magnetic tape, any other magnetic medium, CD-ROM, any other optical medium, punch cards, paper tape, any other physical medium with patterns of holes, RAM, PROM, EPROM, FLASH-EPROM, any other memory chip or cartridge, carrier wave, or any other medium from which a computer can read.
- execution of the sequences of instructions may be performed by a single computer system 500 .
- two or more computer systems 500 coupled by communication link 520 may perform the sequence of instructions in coordination with one another.
- Computer system 500 may transmit and receive messages, data, and instructions, including program (i.e., application code) through communication link 520 and communication interface 512 .
- Received program code may be executed by processor 504 as it is received, and/or stored in disk drive 510 , or other non-volatile storage for later execution.
Abstract
Description
Claims (18)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US11/784,012 US7801868B1 (en) | 2006-04-20 | 2007-04-05 | Surrogate hashing |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US11/408,199 US7840540B2 (en) | 2006-04-20 | 2006-04-20 | Surrogate hashing |
US11/784,012 US7801868B1 (en) | 2006-04-20 | 2007-04-05 | Surrogate hashing |
Related Parent Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US11/408,199 Continuation-In-Part US7840540B2 (en) | 2006-04-20 | 2006-04-20 | Surrogate hashing |
Publications (1)
Publication Number | Publication Date |
---|---|
US7801868B1 true US7801868B1 (en) | 2010-09-21 |
Family
ID=42733994
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US11/784,012 Expired - Fee Related US7801868B1 (en) | 2006-04-20 | 2007-04-05 | Surrogate hashing |
Country Status (1)
Country | Link |
---|---|
US (1) | US7801868B1 (en) |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20100023535A1 (en) * | 2008-07-23 | 2010-01-28 | Institute For Information Industry | Apparatus, method, and computer program product thereof for storing a data and data storage system comprising the same |
US8156132B1 (en) | 2007-07-02 | 2012-04-10 | Pinehill Technology, Llc | Systems for comparing image fingerprints |
US8171004B1 (en) | 2006-04-20 | 2012-05-01 | Pinehill Technology, Llc | Use of hash values for identification and location of content |
US8463000B1 (en) | 2007-07-02 | 2013-06-11 | Pinehill Technology, Llc | Content identification based on a search of a fingerprint database |
US8549022B1 (en) | 2007-07-02 | 2013-10-01 | Datascout, Inc. | Fingerprint generation of multimedia content based on a trigger point with the multimedia content |
US9020927B1 (en) * | 2012-06-01 | 2015-04-28 | Google Inc. | Determining resource quality based on resource competition |
US9020964B1 (en) | 2006-04-20 | 2015-04-28 | Pinehill Technology, Llc | Generation of fingerprints for multimedia content based on vectors and histograms |
US9037545B2 (en) | 2006-05-05 | 2015-05-19 | Hybir Inc. | Group based complete and incremental computer file backup system, process and apparatus |
US9781140B2 (en) * | 2015-08-17 | 2017-10-03 | Paypal, Inc. | High-yielding detection of remote abusive content |
Citations (24)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5918223A (en) | 1996-07-22 | 1999-06-29 | Muscle Fish | Method and article of manufacture for content-based analysis, storage, retrieval, and segmentation of audio information |
US5973692A (en) * | 1997-03-10 | 1999-10-26 | Knowlton; Kenneth Charles | System for the capture and indexing of graphical representations of files, information sources and the like |
US6021491A (en) | 1996-11-27 | 2000-02-01 | Sun Microsystems, Inc. | Digital signatures for data streams and data archives |
US6098054A (en) | 1997-11-13 | 2000-08-01 | Hewlett-Packard Company | Method of securing software configuration parameters with digital signatures |
US6212525B1 (en) | 1997-03-07 | 2001-04-03 | Apple Computer, Inc. | Hash-based system and method with primary and secondary hash functions for rapidly identifying the existence and location of an item in a file |
US20010044719A1 (en) | 1999-07-02 | 2001-11-22 | Mitsubishi Electric Research Laboratories, Inc. | Method and system for recognizing, indexing, and searching acoustic signals |
US20020083060A1 (en) | 2000-07-31 | 2002-06-27 | Wang Avery Li-Chun | System and methods for recognizing sound and music signals in high noise and distortion |
US20030086341A1 (en) | 2001-07-20 | 2003-05-08 | Gracenote, Inc. | Automatic identification of sound recordings |
US6594665B1 (en) | 2000-02-18 | 2003-07-15 | Intel Corporation | Storing hashed values of data in media to allow faster searches and comparison of data |
US20030191764A1 (en) | 2002-08-06 | 2003-10-09 | Isaac Richards | System and method for acoustic fingerpringting |
US6671407B1 (en) | 1999-10-19 | 2003-12-30 | Microsoft Corporation | System and method for hashing digital images |
US6704730B2 (en) | 2000-02-18 | 2004-03-09 | Avamar Technologies, Inc. | Hash file system and method for use in a commonality factoring system |
US20040064737A1 (en) | 2000-06-19 | 2004-04-01 | Milliken Walter Clark | Hash-based systems and methods for detecting and preventing transmission of polymorphic network worms and viruses |
US20040240562A1 (en) | 2003-05-28 | 2004-12-02 | Microsoft Corporation | Process and system for identifying a position in video using content-based video timelines |
US20050172312A1 (en) | 2003-03-07 | 2005-08-04 | Lienhart Rainer W. | Detecting known video entities utilizing fingerprints |
US6952730B1 (en) | 2000-06-30 | 2005-10-04 | Hewlett-Packard Development Company, L.P. | System and method for efficient filtering of data set addresses in a web crawler |
US6963975B1 (en) | 2000-08-11 | 2005-11-08 | Microsoft Corporation | System and method for audio fingerprinting |
US7073197B2 (en) | 1999-05-05 | 2006-07-04 | Shieldip, Inc. | Methods and apparatus for protecting information |
US7139747B1 (en) | 2000-11-03 | 2006-11-21 | Hewlett-Packard Development Company, L.P. | System and method for distributed web crawling |
US20070050761A1 (en) * | 2005-08-30 | 2007-03-01 | Microsoft Corporation | Distributed caching of files in a network |
US20070092103A1 (en) | 2005-10-21 | 2007-04-26 | Microsoft Corporation | Video fingerprinting using watermarks |
US7302574B2 (en) | 1999-05-19 | 2007-11-27 | Digimarc Corporation | Content identifiers triggering corresponding responses through collaborative processing |
US7460994B2 (en) | 2001-07-10 | 2008-12-02 | M2Any Gmbh | Method and apparatus for producing a fingerprint, and method and apparatus for identifying an audio signal |
US20080317278A1 (en) | 2006-01-16 | 2008-12-25 | Frederic Lefebvre | Method for Computing a Fingerprint of a Video Sequence |
-
2007
- 2007-04-05 US US11/784,012 patent/US7801868B1/en not_active Expired - Fee Related
Patent Citations (28)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5918223A (en) | 1996-07-22 | 1999-06-29 | Muscle Fish | Method and article of manufacture for content-based analysis, storage, retrieval, and segmentation of audio information |
US6021491A (en) | 1996-11-27 | 2000-02-01 | Sun Microsystems, Inc. | Digital signatures for data streams and data archives |
US6212525B1 (en) | 1997-03-07 | 2001-04-03 | Apple Computer, Inc. | Hash-based system and method with primary and secondary hash functions for rapidly identifying the existence and location of an item in a file |
US5973692A (en) * | 1997-03-10 | 1999-10-26 | Knowlton; Kenneth Charles | System for the capture and indexing of graphical representations of files, information sources and the like |
US6052486A (en) * | 1997-03-10 | 2000-04-18 | Quickbut, Inc. | Protection mechanism for visual link objects |
US6098054A (en) | 1997-11-13 | 2000-08-01 | Hewlett-Packard Company | Method of securing software configuration parameters with digital signatures |
US7073197B2 (en) | 1999-05-05 | 2006-07-04 | Shieldip, Inc. | Methods and apparatus for protecting information |
US7302574B2 (en) | 1999-05-19 | 2007-11-27 | Digimarc Corporation | Content identifiers triggering corresponding responses through collaborative processing |
US20010044719A1 (en) | 1999-07-02 | 2001-11-22 | Mitsubishi Electric Research Laboratories, Inc. | Method and system for recognizing, indexing, and searching acoustic signals |
US6671407B1 (en) | 1999-10-19 | 2003-12-30 | Microsoft Corporation | System and method for hashing digital images |
US6594665B1 (en) | 2000-02-18 | 2003-07-15 | Intel Corporation | Storing hashed values of data in media to allow faster searches and comparison of data |
US6704730B2 (en) | 2000-02-18 | 2004-03-09 | Avamar Technologies, Inc. | Hash file system and method for use in a commonality factoring system |
US20040064737A1 (en) | 2000-06-19 | 2004-04-01 | Milliken Walter Clark | Hash-based systems and methods for detecting and preventing transmission of polymorphic network worms and viruses |
US6952730B1 (en) | 2000-06-30 | 2005-10-04 | Hewlett-Packard Development Company, L.P. | System and method for efficient filtering of data set addresses in a web crawler |
US20020083060A1 (en) | 2000-07-31 | 2002-06-27 | Wang Avery Li-Chun | System and methods for recognizing sound and music signals in high noise and distortion |
US6963975B1 (en) | 2000-08-11 | 2005-11-08 | Microsoft Corporation | System and method for audio fingerprinting |
US7080253B2 (en) | 2000-08-11 | 2006-07-18 | Microsoft Corporation | Audio fingerprinting |
US7240207B2 (en) | 2000-08-11 | 2007-07-03 | Microsoft Corporation | Fingerprinting media entities employing fingerprint algorithms and bit-to-bit comparisons |
US7139747B1 (en) | 2000-11-03 | 2006-11-21 | Hewlett-Packard Development Company, L.P. | System and method for distributed web crawling |
US7460994B2 (en) | 2001-07-10 | 2008-12-02 | M2Any Gmbh | Method and apparatus for producing a fingerprint, and method and apparatus for identifying an audio signal |
US20030086341A1 (en) | 2001-07-20 | 2003-05-08 | Gracenote, Inc. | Automatic identification of sound recordings |
US7328153B2 (en) | 2001-07-20 | 2008-02-05 | Gracenote, Inc. | Automatic identification of sound recordings |
US20030191764A1 (en) | 2002-08-06 | 2003-10-09 | Isaac Richards | System and method for acoustic fingerpringting |
US20050172312A1 (en) | 2003-03-07 | 2005-08-04 | Lienhart Rainer W. | Detecting known video entities utilizing fingerprints |
US20040240562A1 (en) | 2003-05-28 | 2004-12-02 | Microsoft Corporation | Process and system for identifying a position in video using content-based video timelines |
US20070050761A1 (en) * | 2005-08-30 | 2007-03-01 | Microsoft Corporation | Distributed caching of files in a network |
US20070092103A1 (en) | 2005-10-21 | 2007-04-26 | Microsoft Corporation | Video fingerprinting using watermarks |
US20080317278A1 (en) | 2006-01-16 | 2008-12-25 | Frederic Lefebvre | Method for Computing a Fingerprint of a Video Sequence |
Non-Patent Citations (35)
Title |
---|
Brown, Sheree N., U.S. Patent and Trademark Office Non-Final Office Action, U.S. Appl. No. 11/842,924, filed Sep. 10, 2009, 16 pages. |
Colan, Giovanna B., U.S. Patent and Trademark Office Non-Final Office Action, U.S. Appl. No. 11/732,834, filed Apr. 10, 2009, 24 pages. |
Colan, Giovanna B., U.S. Patent and Trademark Office Non-Final Office Action, U.S. Appl. No. 11/732,835, filed Apr. 15, 2009, 21 pages. |
Colan, Giovanna B., U.S. Patent and Trademark Office Non-Final Office Action, U.S. Appl. No. 11/732,842, filed May 28, 2009, 19 pages. |
Corrielus, Jean M., U.S. Patent and Trademark Office Non-Final Office Action, U.S. Appl. No. 11/732,836, filed Apr. 14, 2009, 18 pages. |
Cottingham, John; Notification of Transmittal of the International Search Report and the Written Opinion of the International Searching Authority, or the Declaration; International Application No. PCT/US07/09816; Date of Mailing Jun. 18, 2008; Form PCT/ISA/220 (2 pages); Form PCT/ISA/210 (2 pages); Form PCT/ISA/237 (6 pages). |
Lynch, Nancy; Malkhi, Dahlia; Ratajczak, David, Atomic Data Access in Distributed Hash Tables, 2002, pp. 295-305, LNCS 2429, Springer-Verlag Berlin Heidelberg. |
Reyes, Mariela D., U.S. Patent and Trademark Office Non-Final Office Action, U.S. Appl. No. 11/408,199, filed Aug. 21, 2009, 27 pages. |
Reyes, Mariela D., U.S. Patent and Trademark Office Non-Final Office Action, U.S. Appl. No. 11/408,199, filed Dec. 17, 2008, 24 pages. |
Reyes, Mariela D., U.S. Patent and Trademark Office Non-Final Office Action, U.S. Appl. No. 11/732,832, filed Sep. 21, 2009, 27 pages. |
Reyes, Mariela, D.; U.S. Office Action and Information Disclosure Statement; U.S. Appl. No. 11/408,199, filed Jun. 11, 2008; 23 pages. |
Sayers, Craig; Eshghi, Kave, The Case for Generating URIs by Hashing RDF Content, Aug. 22, 2002, HPL-2002-216, HP Laboratories Palo Alto. |
Thai, Hanh B., U.S. Patent and Trademark Office Non-Final Office Action, U.S. Appl. No. 11/732,833, filed Apr. 2, 2009, 25 pages. |
Thai, Hanh B., U.S. Patent and Trademark Office Non-Final Office Action, U.S. Appl. No. 11/732,838, filed Apr. 2, 2009, 23 pages. |
U.S. Appl. No. 11/408,199, filed Apr. 20, 2007, Charles F. Kaminski, Jr. |
U.S. Appl. No. 11/732,832, filed Apr. 5, 2007, Charles F. Kaminski, Jr. |
U.S. Appl. No. 11/732,833, filed Apr. 5, 2007, Charles F. Kaminski, Jr. |
U.S. Appl. No. 11/732,834, filed Apr. 5, 2007, Charles F. Kaminski, Jr. |
U.S. Appl. No. 11/732,835, filed Apr. 5, 2007, Charles F. Kaminski, Jr. |
U.S. Appl. No. 11/732,836, filed Apr. 5, 2007, Charles F. Kaminski, Jr. |
U.S. Appl. No. 11/732,838, filed Apr. 5, 2007, Charles F. Kaminski, Jr. |
U.S. Appl. No. 11/732,842, filed Apr. 5, 2007, Charles F. Kaminski, Jr. |
U.S. Appl. No. 11/824,789, filed Jul. 2, 2007, Charles F. Kaminski, Jr. |
U.S. Appl. No. 11/824,815, filed Jul. 2, 2007, Charles F. Kaminski, Jr. |
U.S. Appl. No. 11/824,846, filed Jul. 2, 2007, Charles F. Kaminski, Jr. |
U.S. Appl. No. 11/824,924, filed Jul. 2, 2007, Charles F. Kaminski, Jr. |
U.S. Appl. No. 11/824,957, filed Jul. 2, 2007, Charles F. Kaminski, Jr. |
U.S. Appl. No. 11/824,960, filed Jul. 2, 2007, Charles F. Kaminski, Jr. |
U.S. Appl. No. 11/824,963, filed Jul. 2, 2007, Charles F. Kaminski, Jr. |
U.S. Appl. No. 11/824,973, filed Jul. 2, 2007, Charles F. Kaminski, Jr. |
U.S. Appl. No. 11/824,982, filed Jul. 2, 2007, Charles F. Kaminski, Jr. |
U.S. Appl. No. 11/824,983, filed Jul. 2, 2007, Charles F. Kaminski, Jr. |
U.S. Appl. No. 11/824,995, filed Jul. 2, 2007, Charles F. Kaminski, Jr. |
U.S. Appl. No. 11/824,996, filed Jul. 2, 2007, Charles F. Kaminski, Jr. |
U.S. Appl. No. 11/825,001, filed Jul. 2, 2007, Charles F. Kaminski, Jr. |
Cited By (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8185507B1 (en) | 2006-04-20 | 2012-05-22 | Pinehill Technology, Llc | System and method for identifying substantially similar files |
US9020964B1 (en) | 2006-04-20 | 2015-04-28 | Pinehill Technology, Llc | Generation of fingerprints for multimedia content based on vectors and histograms |
US8171004B1 (en) | 2006-04-20 | 2012-05-01 | Pinehill Technology, Llc | Use of hash values for identification and location of content |
US9037545B2 (en) | 2006-05-05 | 2015-05-19 | Hybir Inc. | Group based complete and incremental computer file backup system, process and apparatus |
US9679146B2 (en) | 2006-05-05 | 2017-06-13 | Hybir Inc. | Group based complete and incremental computer file backup system, process and apparatus |
US10671761B2 (en) | 2006-05-05 | 2020-06-02 | Hybir Inc. | Group based complete and incremental computer file backup system, process and apparatus |
US8463000B1 (en) | 2007-07-02 | 2013-06-11 | Pinehill Technology, Llc | Content identification based on a search of a fingerprint database |
US8549022B1 (en) | 2007-07-02 | 2013-10-01 | Datascout, Inc. | Fingerprint generation of multimedia content based on a trigger point with the multimedia content |
US8156132B1 (en) | 2007-07-02 | 2012-04-10 | Pinehill Technology, Llc | Systems for comparing image fingerprints |
US8204917B2 (en) * | 2008-07-23 | 2012-06-19 | Institute For Information Industry | Apparatus, method, and computer program product thereof for storing a data and data storage system comprising the same |
US20100023535A1 (en) * | 2008-07-23 | 2010-01-28 | Institute For Information Industry | Apparatus, method, and computer program product thereof for storing a data and data storage system comprising the same |
US9020927B1 (en) * | 2012-06-01 | 2015-04-28 | Google Inc. | Determining resource quality based on resource competition |
US10133788B1 (en) * | 2012-06-01 | 2018-11-20 | Google Llc | Determining resource quality based on resource competition |
US9781140B2 (en) * | 2015-08-17 | 2017-10-03 | Paypal, Inc. | High-yielding detection of remote abusive content |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US8185507B1 (en) | System and method for identifying substantially similar files | |
US7814070B1 (en) | Surrogate hashing | |
US7801868B1 (en) | Surrogate hashing | |
US7747083B2 (en) | System and method for good nearest neighbor clustering of text | |
US8495049B2 (en) | System and method for extracting content for submission to a search engine | |
EP3251031B1 (en) | Techniques for compact data storage of network traffic and efficient search thereof | |
US10540606B2 (en) | Consistent filtering of machine learning data | |
US9619487B2 (en) | Method and system for the normalization, filtering and securing of associated metadata information on file objects deposited into an object store | |
US8868569B2 (en) | Methods for detecting and removing duplicates in video search results | |
US20070226207A1 (en) | System and method for clustering content items from content feeds | |
US20170322930A1 (en) | Document based query and information retrieval systems and methods | |
US8126859B2 (en) | Updating a local version of a file based on a rule | |
US20090089278A1 (en) | Techniques for keyword extraction from urls using statistical analysis | |
US20030106017A1 (en) | Computer-implemented PDF document management | |
CN102982053A (en) | Detecting duplicate and near-duplicate files | |
US8843442B2 (en) | Systems and methods for publishing datasets | |
KR20080005491A (en) | Efficiently describing relationships between resources | |
CN102414677A (en) | Data classification pipeline including automatic classification rules | |
CN101158981A (en) | Method, system and device for classifying downloaded resource | |
US20110137855A1 (en) | Music recognition method and system based on socialized music server | |
CN113767390A (en) | Attribute grouping for change detection in distributed storage systems | |
US11334592B2 (en) | Self-orchestrated system for extraction, analysis, and presentation of entity data | |
Li et al. | Juxtapp and dstruct: Detection of similarity among android applications | |
Richard et al. | Digital forensic tools: the next generation | |
US20110320466A1 (en) | Methods and systems for filtering search results |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: DATASCOUT, INC., NEBRASKA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:KAMINSKI, JR., CHARLES F.;REEL/FRAME:019201/0241 Effective date: 20070116 |
|
STCF | Information on status: patent grant |
Free format text: PATENTED CASE |
|
AS | Assignment |
Owner name: PINEHILL TECHNOLOGY, LLC, DELAWARE Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:DATASCOUT, INC.;REEL/FRAME:026609/0668 Effective date: 20110718 |
|
FEPP | Fee payment procedure |
Free format text: PAT HOLDER NO LONGER CLAIMS SMALL ENTITY STATUS, ENTITY STATUS SET TO UNDISCOUNTED (ORIGINAL EVENT CODE: STOL); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY |
|
FPAY | Fee payment |
Year of fee payment: 4 |
|
AS | Assignment |
Owner name: CONCERT DEBT, LLC, NEW HAMPSHIRE Free format text: SECURITY INTEREST;ASSIGNOR:PINEHILL TECHNOLOGY, LLC;REEL/FRAME:036432/0471 Effective date: 20150501 Owner name: CONCERT DEBT, LLC, NEW HAMPSHIRE Free format text: SECURITY INTEREST;ASSIGNOR:PINEHILL TECHNOLOGY, LLC;REEL/FRAME:036501/0779 Effective date: 20150801 |
|
AS | Assignment |
Owner name: CONCERT DEBT, LLC, NEW HAMPSHIRE Free format text: SECURITY INTEREST;ASSIGNOR:CONCERT TECHNOLOGY CORPORATION;REEL/FRAME:036515/0471 Effective date: 20150501 Owner name: CONCERT DEBT, LLC, NEW HAMPSHIRE Free format text: SECURITY INTEREST;ASSIGNOR:CONCERT TECHNOLOGY CORPORATION;REEL/FRAME:036515/0495 Effective date: 20150801 |
|
FEPP | Fee payment procedure |
Free format text: MAINTENANCE FEE REMINDER MAILED (ORIGINAL EVENT CODE: REM.) |
|
FEPP | Fee payment procedure |
Free format text: 7.5 YR SURCHARGE - LATE PMT W/IN 6 MO, LARGE ENTITY (ORIGINAL EVENT CODE: M1555); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY |
|
MAFP | Maintenance fee payment |
Free format text: PAYMENT OF MAINTENANCE FEE, 8TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1552); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY Year of fee payment: 8 |
|
AS | Assignment |
Owner name: CONCERT TECHNOLOGY CORPORATION, NEW HAMPSHIRE Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:PINEHILL TECHNOLOGY, LLC;REEL/FRAME:051395/0368 Effective date: 20191203 |
|
FEPP | Fee payment procedure |
Free format text: MAINTENANCE FEE REMINDER MAILED (ORIGINAL EVENT CODE: REM.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY |
|
LAPS | Lapse for failure to pay maintenance fees |
Free format text: PATENT EXPIRED FOR FAILURE TO PAY MAINTENANCE FEES (ORIGINAL EVENT CODE: EXP.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY |
|
STCH | Information on status: patent discontinuation |
Free format text: PATENT EXPIRED DUE TO NONPAYMENT OF MAINTENANCE FEES UNDER 37 CFR 1.362 |
|
FP | Lapsed due to failure to pay maintenance fee |
Effective date: 20220921 |