US20140009314A1 - Efficient string hash computation - Google Patents

Efficient string hash computation Download PDF

Info

Publication number
US20140009314A1
US20140009314A1 US13/543,010 US201213543010A US2014009314A1 US 20140009314 A1 US20140009314 A1 US 20140009314A1 US 201213543010 A US201213543010 A US 201213543010A US 2014009314 A1 US2014009314 A1 US 2014009314A1
Authority
US
United States
Prior art keywords
string
hash value
original
computer
updated
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US13/543,010
Inventor
Peter D. Bain
Peter W. Burka
Charles R. Gracie
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
International Business Machines Corp
Original Assignee
International Business Machines Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by International Business Machines Corp filed Critical International Business Machines Corp
Priority to US13/543,010 priority Critical patent/US20140009314A1/en
Assigned to INTERNATIONAL BUSINESS MACHINES CORPORATION reassignment INTERNATIONAL BUSINESS MACHINES CORPORATION ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: BAIN, PETER D., BURKA, PETER W., GRACIE, CHARLES R.
Priority to US13/843,952 priority patent/US9019135B2/en
Publication of US20140009314A1 publication Critical patent/US20140009314A1/en
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2255Hash tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques

Definitions

  • This invention relates to apparatus and methods for efficiently computing and recomputing hash values for strings.
  • Sequences of characters are used extensively in modern-day programming languages.
  • the Java runtime uses the String class extensively.
  • every string has a hash value computed over the contents of the string which is used to identify the string.
  • computing the hash value for strings can be computationally expensive.
  • string objects are used heavily by the Java Virtual Machine (JVM) as well as applications running on the JVM, the hash function is invoked frequently. Operation of the hash function, therefore, consumes significant computational resources.
  • JVM Java Virtual Machine
  • the hash value of the modified string needs to be recomputed. Like the original hash value computation, recomputing the hash value can be computationally expensive since the hash value is typically recomputed from scratch. Because string modifications may occur frequently, such recomputations may also occur frequently, consuming significant computational resources.
  • a method for efficiently computing a hash value for a string includes receiving an original string comprising multiple characters.
  • the method computes an original hash value for the original string.
  • the method produces an updated string by performing at least one of the following updates on the original string: adding leading/trailing characters to the original string; removing leading/trailing characters from the original string, and modifying characters of the original string while preserving the length of the original string.
  • the method then computes an updated hash value for the updated string by performing at least one operation on the original hash value, wherein the at least one operation takes into account the updates that were made to the original string.
  • FIG. 1 is a high-level block diagram showing one example of a computing system in which an apparatus and method in accordance with the invention may be implemented;
  • FIG. 2 is a high-level block diagram showing one example of an object-oriented managed runtime, in this example the Java Virtual Machine, comprising a hash module in accordance with the invention
  • FIG. 3A shows a first scenario where a substring is concatenated to an existing string to produce an updated string
  • FIG. 3B shows a technique for efficiently computing the hash value for the updated string illustrated in FIG. 3A .
  • FIG. 4A shows a second scenario where a substring is removed from an existing string to produce an updated string
  • FIG. 4B shows a technique for efficiently computing the hash value for the updated string illustrated in FIG. 4A .
  • FIG. 5A shows a third scenario where a substring is modified within an existing string while preserving the length of the existing string
  • FIG. 5B shows a technique for efficiently computing the hash value for the updated string illustrated in FIG. 5A .
  • the present invention may be embodied as an apparatus, system, method, or computer program product.
  • the present invention may take the form of a hardware embodiment, a software embodiment (including firmware, resident software, microcode, etc.) configured to operate hardware, or an embodiment combining software and hardware aspects that may all generally be referred to herein as a “module” or “system.”
  • the present invention may take the form of a computer-usable storage medium embodied in any tangible medium of expression having computer-usable program code stored therein.
  • the computer-usable or computer-readable storage medium may be, for example but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable storage medium may include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), a portable compact disc read-only memory (CDROM), an optical storage device, or a magnetic storage device.
  • a computer-usable or computer-readable storage medium may be any medium that can contain, store, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
  • Computer program code for carrying out operations of the present invention may be written in any combination of one or more programming languages, including an object-oriented programming language such as JavaTM, Smalltalk, C++, or the like, conventional procedural programming languages such as the “C” programming language, scripting languages such as JavaScript, or similar programming languages.
  • Computer program code for implementing the invention may also be written in a low-level programming language such as assembly language.
  • Embodiments of the invention may be described below with reference to flowchart illustrations and/or block diagrams of methods, apparatus, systems, and computer program products. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, may be implemented by computer program instructions or code. These computer program instructions may be provided to a processor of a general-purpose computer, special-purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
  • the computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means which implement the function/act specified in the flowchart and/or block diagram block or blocks.
  • the computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
  • FIG. 1 one example of a computing system 100 is illustrated.
  • the computing system 100 is presented to show one example of an environment where an apparatus and method in accordance with the invention may be implemented.
  • the computing system 100 is presented only by way of example and is not intended to be limiting. Indeed, the apparatus and methods disclosed herein may be applicable to a wide variety of different computing systems in addition to the computing system 100 shown. The apparatus and methods disclosed herein may also potentially be distributed across multiple computing systems 100 .
  • the computing system 100 includes at least one processor 102 and may include more than one processor 102 .
  • the processor 102 may be operably connected to a memory 104 .
  • the memory 104 may include one or more non-volatile storage devices such as hard drives 104 a , solid state drives 104 a , CD-ROM drives 104 a , DVD-ROM drives 104 a , tape drives 104 a , or the like.
  • the memory 104 may also include non-volatile memory such as a read-only memory 104 b (e.g., ROM, EPROM, EEPROM, and/or Flash ROM) or volatile memory such as a random access memory 104 c (RAM or operational memory).
  • a bus 106 or plurality of buses 106 , may interconnect the processor 102 , memory devices 104 , and other devices to enable data and/or instructions to pass therebetween.
  • the computing system 100 may include one or more ports 108 .
  • Such ports 108 may be embodied as wired ports 108 (e.g., USB ports, serial ports, Firewire ports, SCSI ports, parallel ports, etc.) or wireless ports 108 (e.g., Bluetooth, IrDA, etc.).
  • the ports 108 may enable communication with one or more input devices 110 (e.g., keyboards, mice, touchscreens, cameras, microphones, scanners, storage devices, etc.) and output devices 112 (e.g., displays, monitors, speakers, printers, storage devices, etc.).
  • the ports 108 may also enable communication with other computing systems 100 .
  • the computing system 100 includes a network adapter 114 to connect the computing system 100 to a network 116 , such as a LAN, WAN, or the Internet.
  • a network 116 may enable the computing system 100 to connect to one or more servers 118 , workstations 120 , personal computers 120 , mobile computing devices, or other devices.
  • the network 116 may also enable the computing system 100 to connect to another network by way of a router 122 or other device 122 .
  • a router 122 may allow the computing system 100 to communicate with servers, workstations, personal computers, or other devices located on different networks.
  • a Java Virtual Machine 202 may be configured to operate on a specific platform, which may include an underlying hardware and operating system architecture 204 , 206 .
  • the Java Virtual Machine 202 receives program code 200 , compiled to an intermediate form referred to as “bytecode” 200 .
  • the Java Virtual Machine 202 translates this bytecode 200 into native operating system calls and machine instructions for execution on the underlying platform 204 , 206 .
  • the bytecode 200 may be compiled once to operate on all Java Virtual Machines 202 .
  • a Java Virtual Machine 202 by contrast, may be tailored to the underlying hardware and software platform 204 , 206 . In this way, the Java bytecode 200 may be considered platform independent.
  • the Java runtime uses the String class extensively.
  • every string has a hash value computed over the contents of the string in order to identify the string.
  • Each time a string is modified such as by concatenating a substring to an existing string, removing a substring from the beginning or end of an existing string, or modifying a substring within an existing string that preserves the length of the string, the hash value for the modified string needs to be recomputed.
  • the functionality used to compute or recompute a hash value associated with a string will be referred to as a hash module 208 .
  • hash module 208 is shown in a Java Virtual Machine 202 , it should be recognized that the hash module 208 may also be adapted to programming languages and runtime environments other than Java. Thus, nothing in this disclosure should be interpreted to limit the hash module 208 to the Java Runtime Environment.
  • the hash module 208 may include one or more of a computation module 212 , a determination module 214 , and a recomputation module 216 .
  • the computation module 212 may compute the hash value for the string from scratch.
  • a determination module 214 may determine the type of change that has occurred to the string. For example, the determination module 214 may determine whether a substring has been concatenated 218 to the existing string, a substring has been removed 220 from the beginning and/or end of the existing string, a substring has been modified 222 within the existing string while preserving the length of the existing string, or the like.
  • a recomputation module 216 may efficiently recompute the hash value for the updated string. In doing so, the recomputation module 216 may compute the hash value for the updated string by performing one or more operations on the original hash value of the original string. This recomputation may be less computationally intensive than recomputing the hash value for the updated string from scratch.
  • n-byte string S may be represented as follows:
  • the hash value H(S) may be computed using the following polynomial:
  • H ( S ) k (n ⁇ 1) s[ 0]+ k (n ⁇ 2) s[ 1]+ k (n ⁇ 3) s[ 2]+ . . . + k 2 s[n ⁇ 3]+ k 1 s[n ⁇ 2]+ k 0 s[n ⁇ 1]
  • k (n—1) , k (n ⁇ 2) , k (n ⁇ 3) , . . . , k 2 , k 1 , k 0 are coefficients.
  • all addition is performed modulo g.
  • modulus g is equal to 2 32 and the constant k is equal to 31.
  • H ( S ) k ( k ( . . . ( k ( k ( k ( ks[ 0]+ s[ 1])+ s[ 2])+ s[ 3]) . . . + s[n ⁇ 3])+ s[n ⁇ 2])+ s[n ⁇ 1]
  • H(S.T) H(S.T)
  • H ( S.T ) k (n+m ⁇ 1) s[ 0]+ k (n+m ⁇ 2) s[ 1]+ k (n+m ⁇ 3) s[ 2]+ . . . + k (m+2) s[n ⁇ 3]+ k (m+1) s[n ⁇ 2]+ k (m) s[n ⁇ 1]+ k (m ⁇ 1) t[ 0]+ k (m ⁇ 2) t[ 1]+ k (m ⁇ 3) t[ 2]+ . . . + k (2) t[m ⁇ 3]+ kt[m ⁇ 2]+ t[m ⁇ 1]
  • the hash value of the concatenated string S.T may be computed as follows, as illustrated in FIG. 3 B:
  • H ( S.T ) k m H ( S )+ H ( T )
  • This equation may be extended to compute the hash value of more than two concatenated strings, such as the following equation which computes the hash value for three concatenated strings:
  • H ( S.T.U ) k (m+n) H ( S.T )+ H ( U )
  • the techniques described above may be used to compute the hash value of a long string in parallel. For example, consider a string S which is the concatenation of multiple substrings S 0 , S 1 , . . . , Sf ⁇ 1, Sf. Without a loss of generality, assume that each substring is of length p.
  • the sub-hash values H[S 0 ], H[S 1 ], . . . , H[Sf ⁇ 1], H[Sf] may be computed and combined as follows:
  • H ( S ) H ( S 0)( k (pf) )+ H ( S 1)( k ((p(f ⁇ 1)) )+ . . . + H ( Sf ⁇ 1)( k P )+ H ( Sf )
  • H(S 0 )(k (pf)), H(S1)(k ((p(f ⁇ 1)) ), . . . , H(Sf ⁇ 1)(k P ), H(Sf) may be processed by a different processor core.
  • the sub-hash values may be computed in an interleaved fashion.
  • the sub-hash values may be computed in a four-way parallel fashion, the four sub-hash values may be computed as follows:
  • H ( S 0) k (n ⁇ 1) s[ 0]+ k (n ⁇ 5) s[ 4]+ k (n ⁇ 9) s[ 8]+ . . .
  • H ( S 1) k (n ⁇ 2) s[ 1]+ k (n ⁇ 6) s[ 5]+ k (n ⁇ 10) s[ 9]+ . . .
  • H ( S 2) k (n ⁇ 3) s[ 2]+ k (n ⁇ 7) s[ 6]+ k (n ⁇ 11) s[ 10]+ . . .
  • H ( S 3) k (n ⁇ 4) s[ 3]+ k (n ⁇ 8) s[ 7]+ k (n ⁇ 12) s[ 11]+ . . .
  • S 0 contains the first character of each substring in the string S
  • S 1 contains the second character of each substring in the string S
  • S 2 contains the third character of each substring in the string S
  • S 3 contains the fourth character of each substring in the string S.
  • H ( S ) H ( S 0)+ H ( S 1)+ H ( S 2)+ H ( S 3)
  • n-byte string S may be represented as follows:
  • substring T is of length m.
  • the hash value for the substring U may be computed as follows, as shown in FIG. 4 B:
  • H ( U ) H ( S ) ⁇ k m H ( T )
  • H ( S ) ( H ( T ) k+H ( U )) % g
  • H ( T ) ( H ( S ) ⁇ H ( U )+ m ) k
  • u is a multiple of g selected in advance such that:
  • This equation may be applied recursively to compute the hash value when several characters are removed from the end of a string. Furthermore, by replacing k in the above equations with a power of k, multiple characters may be removed simultaneously.
  • the original string S may be represented as follows:
  • the updated string S′ may be represented as follows:
  • s′[p] and s′[q] are the first and last characters respectively of the modified substring.
  • the hash value of the altered string S′ may be computed by examining the modified characters, such that:
  • the hash value of the updated string S′ may then be computed as follows, as shown in FIG. 5 B:
  • H ( S′ ) H ( S )+ H ( R )
  • H ⁇ ( R ) k p ⁇ ( s ′ ⁇ [ p ] - s ⁇ [ p ] ) + k ( p - 1 ) ⁇ ( s ′ ⁇ [ p - 1 ] - s ⁇ [ p - 1 ] ) + k ( q + 1 ) ⁇ ( s ′ [ q + 1 ) - s ⁇ [ q + 1 ] ) + k q ⁇ ( s ′ ⁇ [ q ] - s ⁇ [ q ] )
  • each block in the block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s).
  • the functions discussed in association with a block may occur in a different order than discussed. For example, two functions occurring in succession may, in fact, be implemented in the reverse order, depending upon the functionality involved.
  • each block of the block diagrams, and combinations of blocks in the block diagrams may be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Abstract

A method for efficiently computing a hash value for a string is disclosed. In one embodiment, such a method includes receiving an original string comprising multiple characters. The method computes an original hash value for the original string. The method produces an updated string by performing at least one of the following updates on the original string: adding leading/trailing characters to the original string; removing leading/trailing characters from the original string, and modifying characters of the original string while preserving the length of the original string. The method then computes an updated hash value for the updated string by performing at least one operation on the original hash value, wherein the at least one operation takes into account the updates that were made to the original string. A corresponding apparatus and computer program product are also disclosed.

Description

    BACKGROUND
  • 1. Field of the Invention
  • This invention relates to apparatus and methods for efficiently computing and recomputing hash values for strings.
  • 2. Background of the Invention
  • Sequences of characters, commonly referred to as “strings,” are used extensively in modern-day programming languages. For example, the Java runtime uses the String class extensively. In the Java runtime, every string has a hash value computed over the contents of the string which is used to identify the string. Because strings may be long, computing the hash value for strings can be computationally expensive. Furthermore, because string objects are used heavily by the Java Virtual Machine (JVM) as well as applications running on the JVM, the hash function is invoked frequently. Operation of the hash function, therefore, consumes significant computational resources.
  • Each time a string is modified, such as by concatenating a substring to an existing string, removing a substring from the beginning or end of an existing string, or modifying a substring within an existing string that preserves the length of the string, the hash value of the modified string needs to be recomputed. Like the original hash value computation, recomputing the hash value can be computationally expensive since the hash value is typically recomputed from scratch. Because string modifications may occur frequently, such recomputations may also occur frequently, consuming significant computational resources.
  • In view of the foregoing, what are needed are apparatus and methods to efficiently compute and recompute hash values for strings and other sequences of characters. Ideally, such apparatus and methods may be used to efficiently recompute hash values for modified strings without having to start from scratch.
  • SUMMARY
  • The invention has been developed in response to the present state of the art and, in particular, in response to the problems and needs in the art that have not yet been fully solved by currently available apparatus and methods. Accordingly, the invention has been developed to provide apparatus and methods for efficiently computing hash values for strings. The features and advantages of the invention will become more fully apparent from the following description and appended claims, or may be learned by practice of the invention as set forth hereinafter.
  • Consistent with the foregoing, a method for efficiently computing a hash value for a string is disclosed herein. In one embodiment, such a method includes receiving an original string comprising multiple characters. The method computes an original hash value for the original string. The method produces an updated string by performing at least one of the following updates on the original string: adding leading/trailing characters to the original string; removing leading/trailing characters from the original string, and modifying characters of the original string while preserving the length of the original string. The method then computes an updated hash value for the updated string by performing at least one operation on the original hash value, wherein the at least one operation takes into account the updates that were made to the original string.
  • A corresponding apparatus and computer program product are also disclosed and claimed herein.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • In order that the advantages of the invention will be readily understood, a more particular description of the invention briefly described above will be rendered by reference to specific embodiments illustrated in the appended drawings. Understanding that these drawings depict only typical embodiments of the invention and are not therefore to be considered limiting of its scope, the invention will be described and explained with additional specificity and detail through use of the accompanying drawings, in which:
  • FIG. 1 is a high-level block diagram showing one example of a computing system in which an apparatus and method in accordance with the invention may be implemented;
  • FIG. 2 is a high-level block diagram showing one example of an object-oriented managed runtime, in this example the Java Virtual Machine, comprising a hash module in accordance with the invention;
  • FIG. 3A shows a first scenario where a substring is concatenated to an existing string to produce an updated string;
  • FIG. 3B shows a technique for efficiently computing the hash value for the updated string illustrated in FIG. 3A.
  • FIG. 4A shows a second scenario where a substring is removed from an existing string to produce an updated string;
  • FIG. 4B shows a technique for efficiently computing the hash value for the updated string illustrated in FIG. 4A.
  • FIG. 5A shows a third scenario where a substring is modified within an existing string while preserving the length of the existing string;
  • FIG. 5B shows a technique for efficiently computing the hash value for the updated string illustrated in FIG. 5A.
  • DETAILED DESCRIPTION
  • It will be readily understood that the components of the present invention, as generally described and illustrated in the Figures herein, could be arranged and designed in a wide variety of different configurations. Thus, the following more detailed description of the embodiments of the invention, as represented in the Figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of certain examples of presently contemplated embodiments in accordance with the invention. The presently described embodiments will be best understood by reference to the drawings, wherein like parts are designated by like numerals throughout.
  • As will be appreciated by one skilled in the art, the present invention may be embodied as an apparatus, system, method, or computer program product. Furthermore, the present invention may take the form of a hardware embodiment, a software embodiment (including firmware, resident software, microcode, etc.) configured to operate hardware, or an embodiment combining software and hardware aspects that may all generally be referred to herein as a “module” or “system.” Furthermore, the present invention may take the form of a computer-usable storage medium embodied in any tangible medium of expression having computer-usable program code stored therein.
  • Any combination of one or more computer-usable or computer-readable storage medium(s) may be utilized to store the computer program product. The computer-usable or computer-readable storage medium may be, for example but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable storage medium may include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), a portable compact disc read-only memory (CDROM), an optical storage device, or a magnetic storage device. In the context of this document, a computer-usable or computer-readable storage medium may be any medium that can contain, store, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
  • Computer program code for carrying out operations of the present invention may be written in any combination of one or more programming languages, including an object-oriented programming language such as Java™, Smalltalk, C++, or the like, conventional procedural programming languages such as the “C” programming language, scripting languages such as JavaScript, or similar programming languages. Computer program code for implementing the invention may also be written in a low-level programming language such as assembly language.
  • Embodiments of the invention may be described below with reference to flowchart illustrations and/or block diagrams of methods, apparatus, systems, and computer program products. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, may be implemented by computer program instructions or code. These computer program instructions may be provided to a processor of a general-purpose computer, special-purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
  • The computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means which implement the function/act specified in the flowchart and/or block diagram block or blocks. The computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
  • Referring to FIG. 1, one example of a computing system 100 is illustrated. The computing system 100 is presented to show one example of an environment where an apparatus and method in accordance with the invention may be implemented. The computing system 100 is presented only by way of example and is not intended to be limiting. Indeed, the apparatus and methods disclosed herein may be applicable to a wide variety of different computing systems in addition to the computing system 100 shown. The apparatus and methods disclosed herein may also potentially be distributed across multiple computing systems 100.
  • As shown, the computing system 100 includes at least one processor 102 and may include more than one processor 102. The processor 102 may be operably connected to a memory 104. The memory 104 may include one or more non-volatile storage devices such as hard drives 104 a, solid state drives 104 a, CD-ROM drives 104 a, DVD-ROM drives 104 a, tape drives 104 a, or the like. The memory 104 may also include non-volatile memory such as a read-only memory 104 b (e.g., ROM, EPROM, EEPROM, and/or Flash ROM) or volatile memory such as a random access memory 104 c (RAM or operational memory). A bus 106, or plurality of buses 106, may interconnect the processor 102, memory devices 104, and other devices to enable data and/or instructions to pass therebetween.
  • To enable communication with external systems or devices, the computing system 100 may include one or more ports 108. Such ports 108 may be embodied as wired ports 108 (e.g., USB ports, serial ports, Firewire ports, SCSI ports, parallel ports, etc.) or wireless ports 108 (e.g., Bluetooth, IrDA, etc.). The ports 108 may enable communication with one or more input devices 110 (e.g., keyboards, mice, touchscreens, cameras, microphones, scanners, storage devices, etc.) and output devices 112 (e.g., displays, monitors, speakers, printers, storage devices, etc.). The ports 108 may also enable communication with other computing systems 100.
  • In certain embodiments, the computing system 100 includes a network adapter 114 to connect the computing system 100 to a network 116, such as a LAN, WAN, or the Internet. Such a network 116 may enable the computing system 100 to connect to one or more servers 118, workstations 120, personal computers 120, mobile computing devices, or other devices. The network 116 may also enable the computing system 100 to connect to another network by way of a router 122 or other device 122. Such a router 122 may allow the computing system 100 to communicate with servers, workstations, personal computers, or other devices located on different networks.
  • As shown in FIG. 2, in the Java Runtime Environment, a Java Virtual Machine 202 may be configured to operate on a specific platform, which may include an underlying hardware and operating system architecture 204, 206. As shown, the Java Virtual Machine 202 receives program code 200, compiled to an intermediate form referred to as “bytecode” 200. The Java Virtual Machine 202 translates this bytecode 200 into native operating system calls and machine instructions for execution on the underlying platform 204, 206. Instead of compiling the bytecode 200 for the specific hardware and software platform 204, 206, the bytecode 200 may be compiled once to operate on all Java Virtual Machines 202. A Java Virtual Machine 202, by contrast, may be tailored to the underlying hardware and software platform 204, 206. In this way, the Java bytecode 200 may be considered platform independent.
  • As previously mentioned, the Java runtime uses the String class extensively. In the Java runtime, every string has a hash value computed over the contents of the string in order to identify the string. Each time a string is modified, such as by concatenating a substring to an existing string, removing a substring from the beginning or end of an existing string, or modifying a substring within an existing string that preserves the length of the string, the hash value for the modified string needs to be recomputed. For the purposes of this disclosure, the functionality used to compute or recompute a hash value associated with a string will be referred to as a hash module 208. While the hash module 208 is shown in a Java Virtual Machine 202, it should be recognized that the hash module 208 may also be adapted to programming languages and runtime environments other than Java. Thus, nothing in this disclosure should be interpreted to limit the hash module 208 to the Java Runtime Environment.
  • As shown, in certain embodiments, the hash module 208 may include one or more of a computation module 212, a determination module 214, and a recomputation module 216. When a string is initially created, the computation module 212 may compute the hash value for the string from scratch. When such a string is updated, however, a determination module 214 may determine the type of change that has occurred to the string. For example, the determination module 214 may determine whether a substring has been concatenated 218 to the existing string, a substring has been removed 220 from the beginning and/or end of the existing string, a substring has been modified 222 within the existing string while preserving the length of the existing string, or the like. Based on the type of change that has occurred to the existing string, a recomputation module 216 may efficiently recompute the hash value for the updated string. In doing so, the recomputation module 216 may compute the hash value for the updated string by performing one or more operations on the original hash value of the original string. This recomputation may be less computationally intensive than recomputing the hash value for the updated string from scratch.
  • In the following discussion associated with FIGS. 3A through 5B, various techniques will be described for computing the hash value for strings which are derived from other strings that already have their hash value computed. The following techniques avoid the need to recompute a hash value for an updated string from scratch, thereby increasing efficiency. Various equations will be presented below to illustrate these techniques. In these equations, the “%” symbol will be used to represent a modulus operator and the “.” symbol will be used to indicate string concatenation.
  • Referring to FIG. 3A, consider the case where a substring T is concatenated to an existing string S, such as where the string “g h i j” is concatentated to the end of the existing string “a b c d e f”. The n-byte string S may be represented as follows:

  • S={s[0], s[1], s[2] . . . s[n'1 2], s[n−1]}
  • where s[0], s[1], . . . , s[n−1] represent each of the characters of the string S.
  • The hash value H(S) may be computed using the following polynomial:

  • H(S)=k (n−1) s[0]+k (n−2) s[1]+k (n−3) s[2]+ . . . +k 2 s[n−3]+k 1 s[n−2]+k 0 s[n−1]
  • where k(n—1), k(n−2), k(n−3), . . . , k2, k1, k0 are coefficients. In certain embodiments, all addition is performed modulo g. In the case of Java, modulus g is equal to 232 and the constant k is equal to 31.
  • The polynomial illustrated above may be expressed in the form of Homer's rule as follows:

  • H(S)=k(k( . . . (k(k(ks[0]+s[1])+s[2])+s[3]) . . . +s[n−3])+s[n−2])+s[n−1]
  • Given two strings S and T of lengths n and m respectively, the hash value H(S.T) for the concatenated strings may be expressed as follows:

  • H(S.T)=k (n+m−1) s[0]+k (n+m−2) s[1]+k (n+m−3) s[2]+ . . . +k (m+2) s[n−3]+k (m+1) s[n−2]+k (m) s[n−1]+k (m−1) t[0]+k (m−2) t[1]+k (m−3) t[2]+ . . . +k (2) t[m−3]+kt[m−2]+t[m−1]
  • Assuming that H(S) and H(T) have already been computed, the hash value of the concatenated string S.T may be computed as follows, as illustrated in FIG. 3B:

  • H(S.T)=k m H(S)+H(T)
  • The above equation avoids the need to recompute the hash value of the concatenated string S.T from scratch.
  • This equation may be extended to compute the hash value of more than two concatenated strings, such as the following equation which computes the hash value for three concatenated strings:

  • H(S.T.U)=k (m+n) H(S.T)+H(U)
  • In certain embodiments, the techniques described above may be used to compute the hash value of a long string in parallel. For example, consider a string S which is the concatenation of multiple substrings S0, S1, . . . , Sf−1, Sf. Without a loss of generality, assume that each substring is of length p. The sub-hash values H[S0], H[S1], . . . , H[Sf−1], H[Sf] may be computed and combined as follows:

  • H(S)=H(S0)(k (pf))+H(S1)(k ((p(f−1)))+ . . . +H(Sf−1)(k P)+H(Sf)
  • where each of the components H(S0)(k(pf)), H(S1)(k ((p(f−1))), . . . , H(Sf−1)(kP), H(Sf) may be processed by a different processor core.
  • Alternatively, the sub-hash values may be computed in an interleaved fashion. For example, assuming the sub-hash values are computed in a four-way parallel fashion, the four sub-hash values may be computed as follows:

  • H(S0)=k (n−1) s[0]+k (n−5) s[4]+k (n−9) s[8]+ . . .

  • H(S1)=k (n−2) s[1]+k (n−6) s[5]+k (n−10) s[9]+ . . .

  • H(S2)=k (n−3) s[2]+k (n−7) s[6]+k (n−11) s[10]+ . . .

  • H(S3)=k (n−4) s[3]+k (n−8) s[7]+k (n−12) s[11]+ . . .
  • where S0 contains the first character of each substring in the string S, S1 contains the second character of each substring in the string S, S2 contains the third character of each substring in the string S, and S3 contains the fourth character of each substring in the string S. Once the sub-hash values for S0, S1, S2, and S3 are calculated, the hash value for the string S may be computed by summing the results as follows:

  • H(S)=H(S0)+H(S1)+H(S2)+H(S3)
  • Referring to FIG. 4A, consider the case where a substring T is removed form a string S, leaving the substring U, such as where the leading substring string “a b c d e f” is removed from the string “a b c de f g h i f”, thereby leaving the string “g h i j”. The n-byte string S may be represented as follows:

  • S=T.U
  • where substring T is of length m.
  • Accordingly, the hash value for the substring U may be computed as follows, as shown in FIG. 4B:

  • H(U)=H(S)−k m H(T)
  • where the hash value H(S) is known (assuming it has already been computed) and the hash value H(T) is unknown.
  • To compute the hash value H(T) of the leading substring T, it can be shown how to compute H(T) when the length of U is one character. Since the value of the polynomial without the modulus operation is generally greater than g, the following equation generally applies:

  • H(S)=(H(T)k+H(U)) % g

  • or

  • H(T)k+H(U)=H(S)+m
  • where m is an integer multiple of g.
  • Rearranging the terms yields:

  • H(T)k=H(S)−H(U)+m
  • Dividing both sides of the equation by k yields:

  • H(T)=(H(S)−H(U)+m)k
  • which leaves no remainder.
  • To find m, the remainder r may be calculated as follows:

  • r=(H(S)−H(U))% k
  • This in turn yields:

  • m=(k−r)u
  • where u is a multiple of g selected in advance such that:

  • u % k=1
  • This equation may be applied recursively to compute the hash value when several characters are removed from the end of a string. Furthermore, by replacing k in the above equations with a power of k, multiple characters may be removed simultaneously.
  • Referring to FIG. 5A, consider the case where a substring (indicated in the dotted box) within a string S is modified to yield an updated string S′ that preserves the length of the original string S. In the illustrated example, the substring “d e f g” within the string S is changed to “k l m n” to yield the updated string S′.
  • The original string S may be represented as follows:

  • S={s[0], s[1], s[2], . . . , s[p], s[p−1], . . . , s[q+1], s[q], . . . , s[n−2], s[n−1]}
  • where the characters between s[p] and s[q] are those that are to be modified.
  • The updated string S′ may be represented as follows:

  • S′={s[0], s[1],s[2] . . . s′[p], s′[p−1], . . . s′[q+1], s′[q] . . . s[n−2], s[n−1]}
  • where s′[p] and s′[q] are the first and last characters respectively of the modified substring.
  • The hash value of the altered string S′ may be computed by examining the modified characters, such that:

  • S′=S+R

  • where

  • R={0 . . . 0, s′[p]−s[p], s′[p−1]−s[p−1], . . . , s′[q+1]−s[q+1], s′[q]−s[q], 0 . . . 0}
  • The hash value of the updated string S′ may then be computed as follows, as shown in FIG. 5B:

  • H(S′)=H(S)+H(R)

  • where
  • H ( R ) = k p ( s [ p ] - s [ p ] ) + k ( p - 1 ) ( s [ p - 1 ] - s [ p - 1 ] ) + k ( q + 1 ) ( s [ q + 1 ) - s [ q + 1 ] ) + k q ( s [ q ] - s [ q ] )
  • The block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer-usable storage media according to various embodiments of the present invention. In this regard, each block in the block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions discussed in association with a block may occur in a different order than discussed. For example, two functions occurring in succession may, in fact, be implemented in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams, and combinations of blocks in the block diagrams, may be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Claims (12)

1-10. (canceled)
11. A computer program product for efficiently computing a hash value for a string, the computer program product comprising a non-transitory computer-readable storage medium having computer-usable program code embodied therein, the computer-usable program code comprising:
computer-usable program code to receive an original string comprising a plurality of characters;
computer-usable program code to compute an original hash value for the original string;
computer-usable program code to produce an updated string by performing at least one of the following updates on the original string: add leading/trailing characters to the original string; remove leading/trailing characters from the original string, and modify characters of the original string; and
computer-usable program code to compute an updated hash value for the updated string by performing at least one operation on the original hash value, wherein the at least one operation takes into account the updates that were made to the original string.
12. The computer program product of claim 11, wherein producing an updated string comprises concatenating a new substring to the original string.
13. The computer program product of claim 12, further comprising computer-usable program code to compute a new hash value for the new substring.
14. The computer program product of claim 13, wherein computing the updated hash value comprises computing the updated hash value as a function of the original hash value and the new hash value.
15. The computer program product of claim 11, wherein producing an updated string comprises removing a substring from the original string.
16. The computer program product of claim 15, further comprising computer-usable program code to compute a hash value for the removed substring.
17. The computer program product of claim 16, wherein computing the updated hash value comprises computing the updated hash value as a function of the original hash value and the hash value of the removed substring.
18. The computer program product of claim 11, wherein producing an updated string comprises modifying a substring within the original string while preserving the length of the original string.
19. The computer program product of claim 18, further comprising computer-usable program code to compute a hash value for the modified substring.
20. The computer program product of claim 19, wherein computing the updated hash value comprises computing the updated hash value as a function of the original hash value and the hash value of the modified substring.
21. An apparatus for efficiently computing a hash value for a string, the apparatus comprising:
at least one processor;
at least one memory device coupled to the at least one processor and storing computer instructions for execution on the at least one processor, the computer instructions enabling the at least one processor to:
receive an original string comprising a plurality of characters;
compute an original hash value for the original string;
produce an updated string by performing at least one of the following updates on the original string: add leading/trailing characters to the original string; remove leading/trailing characters from the original string, and modify characters of the original string; and
compute an updated hash value for the updated string by performing at least one operation on the original hash value, wherein the at least one operation takes into account the updates that were made to the original string.
US13/543,010 2012-07-06 2012-07-06 Efficient string hash computation Abandoned US20140009314A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
US13/543,010 US20140009314A1 (en) 2012-07-06 2012-07-06 Efficient string hash computation
US13/843,952 US9019135B2 (en) 2012-07-06 2013-03-15 Efficient string hash computation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
US13/543,010 US20140009314A1 (en) 2012-07-06 2012-07-06 Efficient string hash computation

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US13/843,952 Continuation US9019135B2 (en) 2012-07-06 2013-03-15 Efficient string hash computation

Publications (1)

Publication Number Publication Date
US20140009314A1 true US20140009314A1 (en) 2014-01-09

Family

ID=49878108

Family Applications (2)

Application Number Title Priority Date Filing Date
US13/543,010 Abandoned US20140009314A1 (en) 2012-07-06 2012-07-06 Efficient string hash computation
US13/843,952 Expired - Fee Related US9019135B2 (en) 2012-07-06 2013-03-15 Efficient string hash computation

Family Applications After (1)

Application Number Title Priority Date Filing Date
US13/843,952 Expired - Fee Related US9019135B2 (en) 2012-07-06 2013-03-15 Efficient string hash computation

Country Status (1)

Country Link
US (2) US20140009314A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110505051A (en) * 2019-08-28 2019-11-26 无锡科技职业学院 Character string Hash processing method and processing device

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9503442B1 (en) 2014-06-20 2016-11-22 EMC IP Holding Company LLC Credential-based application programming interface keys
EP3611647B1 (en) * 2018-08-15 2024-01-03 Ordnance Survey Limited Method for processing and verifying a document

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4588985A (en) 1983-12-30 1986-05-13 International Business Machines Corporation Polynomial hashing
US4899128A (en) * 1985-12-11 1990-02-06 Yeda Research And Development Co., Ltd. Method and apparatus for comparing strings using hash values
EP1207454A1 (en) 2000-11-15 2002-05-22 International Business Machines Corporation Java run-time system with modified linking identifiers
KR20050065976A (en) * 2003-12-26 2005-06-30 한국전자통신연구원 Apparatus and method for computing sha-1 hash function
US7783688B2 (en) * 2004-11-10 2010-08-24 Cisco Technology, Inc. Method and apparatus to scale and unroll an incremental hash function
US7747635B1 (en) 2004-12-21 2010-06-29 Oracle America, Inc. Automatically generating efficient string matching code
US7613701B2 (en) 2004-12-22 2009-11-03 International Business Machines Corporation Matching of complex nested objects by multilevel hashing
US7827384B2 (en) 2007-07-16 2010-11-02 Cisco Technology, Inc. Galois-based incremental hash module
US7982636B2 (en) 2009-08-20 2011-07-19 International Business Machines Corporation Data compression using a nested hierachy of fixed phrase length static and dynamic dictionaries
US8387003B2 (en) 2009-10-27 2013-02-26 Oracle America, Inc. Pluperfect hashing

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110505051A (en) * 2019-08-28 2019-11-26 无锡科技职业学院 Character string Hash processing method and processing device

Also Published As

Publication number Publication date
US20140012829A1 (en) 2014-01-09
US9019135B2 (en) 2015-04-28

Similar Documents

Publication Publication Date Title
US10778441B2 (en) Redactable document signatures
US10002051B2 (en) Data boundary identification for identifying variable size data chunks
US9875118B2 (en) Method and embedded device for loading driver
JP5950285B2 (en) A method for searching a tree using an instruction that operates on data having a plurality of predetermined bit widths, a computer for searching a tree using the instruction, and a computer thereof program
US8862555B1 (en) Methods and apparatus for generating difference files
US10546002B2 (en) Multiple sub-string searching
US9019135B2 (en) Efficient string hash computation
CN103309893A (en) Character string comparing method and device
US8839217B2 (en) Efficiently solving the “use-def” problem involving label variables
JP4484630B2 (en) Variable length decoding apparatus, variable length decoding method and playback system
US9760110B2 (en) Lookup table sharing for memory-based computing
US20180364993A1 (en) Generating executable files through compiler optimization
CN108762720B (en) Data processing method, data processing device and electronic equipment
CN108804883B (en) Method and device for running obfuscated code
US9843442B2 (en) Operation method and apparatus for providing compression function for fast message hashing
WO2011099104A1 (en) File name management method and file name management device
US10078586B2 (en) Out-of-range reference detection device, method, and recording medium
JP2015159352A (en) Data compression device, data compression method, and program
US8363825B1 (en) Device for and method of collision-free hashing for near-match inputs
US20130227250A1 (en) Simd accelerator for data comparison
Heyworth et al. A package for Identities among Relators
CN101382883A (en) Implementing method of multiplier and multiplier apparatus
JP2010119136A (en) Method of accessing memory
JP2010198164A (en) Data arrangement method, compile device, and program

Legal Events

Date Code Title Description
AS Assignment

Owner name: INTERNATIONAL BUSINESS MACHINES CORPORATION, NEW Y

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:BAIN, PETER D.;BURKA, PETER W.;GRACIE, CHARLES R.;REEL/FRAME:028500/0123

Effective date: 20120705

STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO PAY ISSUE FEE