CN108197470A - Fast signature scan - Google Patents

Fast signature scan Download PDF

Info

Publication number
CN108197470A
CN108197470A CN201711339378.4A CN201711339378A CN108197470A CN 108197470 A CN108197470 A CN 108197470A CN 201711339378 A CN201711339378 A CN 201711339378A CN 108197470 A CN108197470 A CN 108197470A
Authority
CN
China
Prior art keywords
condition code
fingerprint
code
scanning
condition
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201711339378.4A
Other languages
Chinese (zh)
Inventor
王强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN201711339378.4A priority Critical patent/CN108197470A/en
Priority claimed from CN201410055830.4A external-priority patent/CN103793522B/en
Publication of CN108197470A publication Critical patent/CN108197470A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0207Discounts or incentives, e.g. coupons or rebates
    • G06Q30/0225Avoiding frauds

Abstract

The method and system of scanning feature code on String field.In one embodiment, the present invention provides a kind of signature scan methods.The method includes one or more features code is processed into one or more forms, the form includes one or more fingerprints of each fixed length condition code or feature subcode and one or more follow-up searching data structures, so that the number of the fingerprint of each fixed length condition code or feature subcode is equal to the step-length of signature scan operation, and the specific fixed length condition code or feature subcode can be identified on any position in any scanned String field, receive specific character string field, identify any condition code included by the specific character string field, it is included in each using scanning step to scan the fingerprint on the position of spacing, with there are one or the position of multiple matched fingerprints on search the follow-up searching data structure, with any identified condition code of output.

Description

Fast signature scan
Technical field
The present invention relates to the condition codes in scanning String field.
Background technology
The object (such as file, program, webpage, Email, internet data packet or digital picture) of digital content can be with Include one or more String fields.One String field is a data value for typically representing word or executable code String.For example, an internet data packet can include network address, host name, Hypertext Transfer Protocol (HTTP) header, hypertext biography Defeated agreement message, e-mail attachment, Email Header and Email content.The size of one String field can be from several A byte is to millions of with last byte.One character string condition code can be a string of specific data values indicated completely or The expression formula (such as specific regular expression) of specific data value, the purpose is to be used for identifying a character string object (such as spy Fixed computer virus or specific gene order).Condition code can be stored in a sign code database.One word levies code Database can include multiple condition codes.The size of one character string condition code can be from several bytes to thousands of a bytes.
Character string condition code and String field are all the bit strings for including many basic units.One basic unit It is the minimum semantic unit that has, therefore scanning element is used as usually in signature scan technology.One basic unit it is big It is small by application depending on.For example, the base unit of English character string is typically 8 bits (i.e. a byte), and a computer disease The base unit of malicious condition code is typically a byte or nybble.
The basic unit of each condition code can be designated as being equal or different to some particular value or specific at some In the range of (such as in the digital scope 0 to 9 or in English alphabet range a to z).Each basic unit can be case-insensitive It is or case sensitive.Each basic unit can support simple logical operation (such as " non-").In addition, each condition code can wrap Include asterisk wildcard, for example, " * " (a random length asterisk wildcard) or "" (a fixed length asterisk wildcard), wherein " * " represents zero or appoints Multiple arbitrary basic units of anticipating and "" represent an arbitrary basic unit.It, can be into one for each random length feature code sign Step indicates its random length range.When a condition code includes random length character, the indefinite length of condition code.An if feature Code does not include random length character, and length is fixed.
One typical signature scan process may include on all possible position in a String field, compare Corresponding condition code in the String field and condition code database.Sweep speed is usually by the size of condition code and complexity Property limitation.In addition, sweep speed also by condition code is newer one by one can power limit.
Invention content
An embodiment of the present invention provides the method and systems of the scanning feature code on String field.In general, this hair The embodiment of bright one side provides character string signature scan method, and the method includes at one or more features code One or more forms are managed into, the form includes each fixed length feature subcode of each fixed length condition code or random length condition code One or more fingerprints and one or more follow-up searching data structures, it is special that one or more of fingerprints include specific fixed length Code or the j-th fingerprint of feature subcode are levied, the first basic unit of the j-th fingerprint is in the specific fixed length condition code or spy The remainder for levying the step-length of position in a scanning direction divided by the signature scan operation in subcode is equal to J, so that described The number of fingerprint is equal to the step-length of signature scan, and so that the specific fixed length condition code or feature subcode are swept any It can be identified on any position in the String field retouched, wherein each fingerprint includes specific fixed length condition code or spy One or more segments of subcode are levied, one or more of segments have in the specific fixed length condition code or feature subcode From anywhere in specific position, receive a specific character string field being made of data value, identify the specific character string Any condition code included by field is included in each using scanning step on the position of spacing, to scan the specific character string Field, with the one or more fingerprints of searching one or more features code and there are one or multiple matched fingers On the position of line, the specific character string field is searched, to search described in one or more follow-up searching data structures and output Any identified condition code in specific character string field.The other embodiments of the aspect of the present invention include the method Corresponding system, device and computer software product.
These and other embodiment optionally includes one or more following characteristics.Each fixed length condition code or feature subcode Have multiple fingerprints and it is described scanning be included in it is each using scanning step on the position of spacing, to scan the String field, with Multiple fingerprints of one or more features code are searched, including two or more fingerprints of parallel search.The one of special characteristic code Each fingerprint in a or multiple fingerprints is to indicate completely in former space or after one or more shadow spaces are projected to , the shadow space is space more wider array of than the original Space format, and the shadow space is arrived by introducing some ambiguities The original space, so that a fingerprint shadow in specific shadow space corresponds to one or more in the former space Fingerprint.
The character string signature scan method can further comprise on each position using scanning step as spacing, Specific character string field described in former spacescan, to search one or more fingerprints and first each using scanning step as spacing Position on, in each shadow space in one or more shadow spaces of one or more fingerprints, scanning it is described specific String field, to search one or more fingerprints, then again at least one of one or more shadow spaces shadow There are one in subspace or on the position of multiple identified fingerprints, one in identified fingerprint is verified in the former space It is a or two.Some ambiguities are introduced the upper case and lower case letter in former space is further comprised to become one to the former space All numbers from 0 to 9 in former space are become an identical number and the sky former space by not case sensitive letter Lattice and "-" become one or more of space or "-".
Scanning for one or more of fingerprints of one or more of condition codes can further comprise using one One or more of a or multiple hash tables and grand (Bloom) filter of one or more cloth are scanned.It scans for One or more of fingerprints of one or more of condition codes can further comprise using hashed value demultiplexer and One or more of one fingerprint length demultiplexer is scanned.The number of the different fingerprint length of multiple condition codes is comparable The number of the different characteristic code length of the multiple condition code is few, and the scanning can further comprise each with scanning step On the position of spacing, to scan the specific character string field, to search multiple fingerprints of the multiple condition code, including parallel Search the identical fingerprint of two or more length.One or more of fingerprints can be chosen, so that the fingerprint Length be restricted to the length of one group of one or more length being covered in one or more length ranges, to provide more points The finger scan of resolution.The length of each fingerprint can be the integral multiple of the step-length of signature scan operation.The character string Signature scan method can further comprise limited certainly using one or more content adressable memories (CAM) and one or more One or more of motivation (FA) is scanned, to search one or more of fingers of one or more of condition codes Line.
The character string signature scan method can further comprise each fingerprint of multiple fingerprints multiple condition codes One or more fingerprint sections are decomposed into, so that the number of the different fingerprint segment length of described document information is than described document information The number of different fingerprint length is few, on each position using scanning step as spacing, scans the specific character string field, with The multiple fingerprint section is searched, is synthesized including two or more fingerprint sections of parallel search and identified fingerprint section Any fingerprint matching.All fingerprint sections can have there are one identical length, and the scanning specific character string field is to look into Look for the step-length of the signature scan operation of integral multiple that multiple fingerprint sections can be the fingerprint segment length with one.One illustrates spy Determining the fingerprint section figure of one or more possible positions of the fingerprint section in any fingerprint can be stored in the particular fingerprint section Together, for identified fingerprint section is synthesized any fingerprint matching.Illustrate that the fingerprint of one or more possible length is long It is any for identified fingerprint section is synthesized together with degree information can be stored with the first segment of each fingerprint or each section Fingerprint matching.One or more finite automatas (FA) can be used to identified fingerprint section to synthesize any fingerprint matching.
The character string signature scan method can further comprise storing the matched probability of false positive of each fingerprint, There are one or the position of multiple matched fingerprints on, check the corresponding matched probability of false positive and when one or more When a matched probability of false positive in a matched probability of false positive is insufficient to low, the specific character string field is searched, To search one or more of follow-up searching data structures.The method can further comprise more corresponding to a fingerprint The different basic unit of the one or more of a fixed length condition code or feature subcode builds a difference searching data structure and looks into The specific character string field is looked for, to search one or more of follow-up searching data structures, one has been searched including difference Multiple fixed length condition codes or feature subcode corresponding to the fingerprint of identification.The method can further comprise each fixed length spy The one or more code film bits of encoded of code or feature subcode are levied, one or more of yards of film bits include illustrating one Or one or more code film bits of multiple following matching conditions:Without matching, if case sensitive, logic NOT predefines Range, logical operation and any range, one or more of yards of film bits include one or more following code film bits:One The code film bit of a or multiple basic units or sub- basic unit, the code film bit of one or more features subcode section and one Or the code film bit of multiple fixed length condition codes or feature subcode, search the specific character string field, with search it is one or Multiple follow-up searching data structures, including described by the fixed length condition code of code film bits of encoded or feature subcode to search.It is described Method can further comprise the specific character string field to standardize, including encoded specific character string field decoding, The specific character string field decompression compressed and one or more of unwanted string data deletion process.
In general, the embodiment of one aspect of the present invention provides character string signature scan method, the method packet It includes and each condition code in multiple condition codes is resolved into one or more features code section, receive a spy being made of data value Determine String field, scan the specific character string field, to search the multiple condition code section of the multiple condition code, packet Include parallel search two or more condition code sections, identified condition code section synthesize any condition code matching and Export any identified condition code in the specific character string field.The other embodiments of the aspect of the present invention include System corresponding to the method, device and computer software product.
Specific implementation may include one or more following characteristics.Scanning for the multiple condition code section can further wrap Include use:One or more hash tables, one or more Bloom filters, one or more have hashed value multiplexing or length multiplexing Or both hash table and it is one or more have hashed value be multiplexed or the Bloom filter of length multiplexing or both in one or It is multiple to be scanned.One condition code section for illustrating one or more possible positions of the special characteristic code section in any condition code Bitmap can be stored together with the special characteristic code section, identified condition code section to be used for synthesize any condition code Matching.Illustrate one or more possible feature code lengths condition code length information can further with each condition code first Section or each section of storage together, match for identified condition code section is synthesized any condition code.It is one or more Finite automata (FA) can be used to identified condition code section to synthesize any condition code matching.
In general, the embodiment of one aspect of the present invention provides character string signature scan method, the method packet It includes an one or more features code and is processed into one or more forms, including each random length in one or more features code Condition code is decomposed into multiple fixed length feature subcodes and one or more random length feature subcodes, what reception one was made of data value Specific character string field identifies any condition code included by the specific character string field, including scanning the specific character String field, with search multiple fixed length condition codes or feature subcode and there are one or multiple fixed length feature subcode quilts On the position of identification, the identified fixed length feature subcode is synthesized any random length condition code and output is described specific Any identified condition code in String field.Handling one or more features code can be further into one or more forms Including storing the location information of each fixed length feature subcode to static nature code composition rule database, the location information includes One order and one to next fixed length feature subcode distance range and including or do not include to each pair of fixed length feature subcode it Between random length feature subcode description and described identified fixed length feature subcode is synthesized any random length condition code Matching further comprise checking the location information of each identified fixed length feature subcode and verify or do not verify it is each pair of adjacent fixed One behavioral characteristics code synthetic state table of the random length feature subcode and update between long feature subcode.It is one or more Finite automata (FA) can be used to identified fixed length feature to synthesize any random length condition code.The institute of the present invention The other embodiments for stating aspect include the system corresponding to the method, device and computer software product.
In general, the embodiment of one aspect of the present invention provides character string signature scan method, the method packet It includes and selects multiple fixed length condition codes, the specific character string object for each character string object in one or more character string objects Multiple fixed length condition codes of part include j-th fixed length condition code, and the first basic unit of the j-th fixed length condition code is described The remainder of the step-length of position in a scanning direction divided by signature scan operation in specific character string object is equal to J, so as to So that the number of the fixed length condition code of the specific character string object is equal to the step-length of signature scan operation, and cause described Specific character string object can be identified on any position in any scanned String field, receive one by data It is worth the specific character string field of composition, identifies any character string object included by the specific character string field, is included in every It is a using scanning step on the position of spacing, to scan the specific character string field, to search one or more of character strings The multiple fixed length condition code of object, including described in two or more fixed length condition codes of parallel scan and output Any identified character string object in specific character string field.The method can further comprise for each character string object Multiple random length condition codes based on multigroup nonoverlapping orderly fixed length condition code are selected, in the multiple random length condition code Each random length condition code include a fixed length condition code in every group of fixed length condition code and one a pair of adjacent fixed length The random length condition code section that condition code connects, so that the number of the condition code of each character string object is equal to Sn, Middle S is the number of the condition code of scanning step or every group of fixed length condition code and n is the group number of fixed length condition code and identification institute It states any character string object in specific character string field to further comprise scanning the specific character string field, to search one Or multiple random length condition codes of multiple character string objects multiple fixed length feature subcodes and there are one or it is multiple matched fixed On the position of long feature subcode, identified fixed length feature subcode is synthesized any random length condition code.
In general, the embodiment of one aspect of the present invention provides character string signature scan method, the method packet It includes and selects one or more fixed length condition codes for each character string object in one or more character string objects, one Or one or more of fixed length condition codes of multiple character string objects are processed into one or more forms, the form includes every One or more fingerprints of a fixed length condition code and one or more follow-up searching data structures, the specific character string object Multiple fingerprints include j-th fingerprint, and the first basic unit of fingerprint described in the j-th is in the specific character string object Position divided by the remainder of the step-length of signature scan operation in a scanning direction is equal to J, so that the specific character string The fingerprint number of object is equal to the step-length of signature scan operation, and causes the specific character string object any scanned It can be identified on any position in String field, wherein each fingerprint includes the institute of the specific character string object State one or more segments of the specific fixed length condition code in one or more fixed length condition codes, one or more of segment tools There is the specific position from anywhere in the specific fixed length condition code, receive a specific character string being made of data value Field identifies any character string object in the specific character string field, is included in the position that each scanning step is spacing On, the specific character string field is scanned, to search multiple fingerprints of one or more of character string objects, wherein wrapping Include two or more described fingerprints of parallel search and there are one or the position of multiple matched fingerprints on, search The specific character string field, to search appointing in the follow-up searching data structure and the output specific character string field The character string object of what identification.The method can further comprise not being overlapped based on multigroup for the selection of each character string object is multiple Orderly fixed length condition code random length condition code, each random length condition code in the multiple random length condition code includes A fixed length condition code and a random length that a pair of adjacent fixed length condition code is connected in every group of fixed length condition code Condition code section fingerprint, so that the number of the random length condition code of each character string object is equal to determining for each group fixed length condition code Any character string object included by the product of the number of long condition code and the identification specific character string field includes sweeping The specific character string field is retouched, to search multiple fingerprints of one or more character string objects, including with parallel search two Or more than two fingerprints and there are one or the position of multiple matched fingerprints on, the specific character string field is searched, to look into Look for the follow-up searching data structure.The other embodiments of the aspect of the present invention include the system corresponding to the method, Device and computer software product.
In general, one aspect of the invention provides a character string signature scan system, and the system comprises one A machine readable storage device including computer program product and one or more executable computer program products and behaviour Make the processor for including following one or more modules:One can be processed into one or more features code including each fixed length spy Levy one or more fingerprints of each fixed length feature subcode of code or random length condition code and one or more follow-up searching datas The condition code preprocessing module of one or more forms of structure, one or more of fingerprints include specific fixed length condition code or The j-th fingerprint of feature subcode, the first basic unit of the j-th fingerprint is in the fixed length condition code or feature subcode Position divided by the remainder of the step-length of signature scan operation in a scanning direction is equal to J, so that the number of the fingerprint Equal to the step-length of signature scan, and cause the fixed length condition code or feature subcode in any scanned String field In any position on can be identified, wherein each fingerprint include one of specific fixed length condition code or feature subcode or Multiple segments, one or more of segments have the spy from anywhere in the specific fixed length condition code or feature subcode Positioning is put, and an input String field processing that one can be made of data value is the form needed for one or more scan Scanning pre-processing engine and one can identify one or more of one or more features code on the input String field The finger scan engine of a fingerprint, the identification are included on the position that each scanning step is spacing, scan the input word String field is accorded with, to search one or more of fingerprints of one or more of condition codes.The system can further comprise The fixed length of the fixed length feature subcode of one fixed length condition code that can recognize that corresponding to identified fingerprint or random length condition code is special Levy code Lookup engine.The system can further comprise a fixed length feature subcode included identified random length condition code Synthesize the random length condition code Lookup engine of the recognizable random length condition code of any random length condition code.The present invention's is described The other embodiments of aspect include the method corresponding to the system, device and computer software product.
Specific implementation may include one or more following characteristics.One or more may be selected in described document information preprocessing module An one or more fingerprints are simultaneously projected to one or more of shadow spaces and go to scan by shadow space.Condition code preprocessing module Each fingerprint in one or more fingerprints can be decomposed into one or more fingerprint sections of one or more length, and each The fingerprint composite signal of fingerprint section stores to a fingerprint database and the finger scan engine and can recognize that the input character Multiple fingerprints of one or more of string field condition code, the identification are included in each position using scanning step as spacing On, scan the input String field, with search multiple fingerprint sections and there are one or multiple identified fingerprint sections position It puts, identified fingerprint section is synthesized any fingerprint matching.
Described document information preprocessing module can use one or more code film bits, special to the one or more of a condition code Sign code section is encoded, and the one or more features code section of one or more of yards of film bits and described document information is stored Together.Described document information preprocessing module can build a difference with the different basic unit of the one or more of multiple condition codes Divide searching data structure.Described document information preprocessing module can build one and include a fingerprint database, a fixed length feature Code database and a condition code regular data when described document information scanning system has at least one random length condition code The condition code database in library.
The scanning pre-processing engine can further comprise a scanning conveyer, a projector, a character string word Segment memory and a shadow field Memory.The scanning pre-processing engine can be handled one or more character string by block Field, the processing include conveying, and decoding standardizes and converts one or more, one or more described character string Each string chunk in the block includes one for the finger scan region that finger scan and condition code are searched, and one is used for feature The preceding condition code before finger scan region that code is searched search region and one for condition code search in finger scan Rear condition code after region searches region.Trizonal each region of one string chunk can be stored to one or more In the identical memory block of a size, all trizonal all memory blocks individually or with one or more add-in memories blocks May make up one searches the ring that first memory block in region starts by current preceding condition code, to reduce data in memory It is mobile.
The finger scan engine can use one or more hash tables and one or more Bloom filters one of those Or multiple detect one or more fingerprints.The finger scan engine can further comprise a finger scan controller, one A fingerprint hash calculator, a fingerprint finder, a fingerprint synthesizer and a fingerprint database.The fingerprint hash Calculator can in order calculate multiple with a sequence hash function in the prefix segment of multiple hashed keys not overlapped each other Multiple hashed values of hashed key.The finger scan engine may include one fingerprint section bit map of a use and fingerprint length letter One or more of breath concurrently or sequentially synthesizes multiple fingerprint sections the matched fingerprint synthesizer of any fingerprint.It is described Finger scan engine may include a fingerprint synthesizer for further comprising one or more finite automatas (FA).
One or more fingerprints of one or more length can be broken down into the fingerprint section of multiple same sizes, and by one Or multiple finger scan engines for having same scan step-length scan, each fingerprint of one or more of finger scan engines is swept Retouch the position of the one or more nonoverlapping input String fields staggeredly of engine covering so that it is one or Total scanning step of multiple finger scan engines is equal to the number of one or more of finger scan engines and single fingerprint is swept It retouches the product of the former scanning step of engine or covers one or more partly overlapping input String fields staggeredly Position, so that total scanning step of one or more of finger scan engines is swept between the original of single finger scan engine Retouch the number of step-length and one or more of finger scan engines and the product of the former scanning step of single finger scan engine Between.The number of the smaller finger scan engine of the product of scanning step and memory speed can be more than scanning step and memory speed The larger finger scan engine of product number.
Covering one or more memories used in one or more finger scan engines of a shorter fingerprint section can wait In or be faster than covering one longer fingerprint section one or more finger scan engines used in one or more memories and cover One used in one or more fixed length condition code Lookup engines corresponding to the shorter fingerprint of the one or more average lengths of lid Or multiple memories can be equal or faster than one or more fixed length corresponding to the longer fingerprint of the one or more average lengths of covering One or more memories used in condition code Lookup engine.One or more covering is shorter than the finger scan of the fingerprint of specific length Engine can individually or together with the first part of corresponding fixed length condition code Lookup engine use one or more in the scanning Most fast memory in system.Scanning multiple finger scan engines of identical one or more fingerprints can share in a multiport It deposits.The finger scan engine can further comprise one or more content adressable memories (CAM).
The fixed length condition code Lookup engine can further comprise a condition code search engine, and a condition code verifies device, With a fixed length condition code database.Described document information search engine and described document information, which verify device, can use a condition code unit ratio Compared with the segment for the condition code that device and a condition code section comparator carry out the one or more tape code films of comparison, with identification one or Multiple fixed length condition codes.Described document information search engine can the one or more fixed length condition codes of difference lookup or feature subcode.It is described Random length condition code Lookup engine can further comprise a condition code composition rule finder, a condition code state verification Device, a condition code composition rule database and a condition code state table.The random length condition code Lookup engine can wrap Include a finite automata (FA).One or more engines may include one or more content adressable memories (CAM) and one Or multiple finite automatas (FA) one or more.
Specific embodiment described in this specification can be used to realize following one or more advantages.The present invention provides one to sweep Retouch the character string scanning system of the condition code in a condition code library.The character string scanning system is flexible and is easily updated. Even if a character string signature scan engine is when a large amount of condition code (such as hundreds of thousands condition code) of scanning, complicated condition code (as up to several kilobytes or with asterisk wildcard " * " and "", range, case-insensitive, logic NOT) and a dynamic feature During code library, the sweep speed (such as 100Gbps) being exceedingly fast is stilld provide.The sweep speed and condition code of the character string scanning system The size in library and complexity scalability.In addition, the smaller memory of the character string scanning system demand and memory bandwidth.It is described Character string scanning system can be implemented with software or field programmable gate array (FPGA) or application-specific integrated circuit (ASIC).This Outside, the cost-effectiveness of the character string scanning system is high, is not only suitable for high-end product, is also applied for low cost products.
One or more embodiments of the invention sees below the description and the appended drawings.Other features and advantages of the present invention by Specification, attached drawing and claims are apparent.
Description of the drawings
Figure 1A shows the structure chart of an exemplary quick character string signature scan system;
Figure 1B shows the exemplary process diagram of a structure character string condition code database;
Fig. 1 C show the exemplary process diagram of a character string signature scan;
Fig. 2A -2C show the example data structure of a fingerprint database;
Fig. 2 D-E show the example data structure of Hash-entry block and the embodiment of a fingerprint synthesizer;
Fig. 2 F-2G show the example data structure of Hash-entry block and the reality of a corresponding fingerprint synthesizer Apply example;
Fig. 2 H-I show the example data structure and a corresponding parallel fingerprint synthesizer of a Hash-entry block Embodiment;
Fig. 3 A-B show the example data of the feature code group chained list searched for fixed length condition code and condition code chained list Structure;
Fig. 4 A-B show that one of the predefined global unit range of support one of a String field is exemplary The structure chart of condition code unit comparator and a condition code section comparator;
Fig. 4 C show the structure chart of an example feature code unit comparator for supporting local condition code unit range;
Fig. 5 A-C show one for the selecting unit tree of fixed length condition code lookup and the example of condition code family chained list Property data structure;
Fig. 6 shows the example data structure of a condition code regulation linked for the lookup of random length condition code;
Fig. 7 shows the example data structure of the condition code state-chain-table of a specific character string field;
Fig. 8 shows an example by a condition code state Bloom filter or the Hash-entry block of hash table meaning Property data structure;
Fig. 9 shows an exemplary computer system.
The similar reference numeral element similar with mark expression in the drawings.
Specific embodiment
General survey
The present invention is the method being scanned for a character string condition code database to a String field and is System.In one embodiment, the scan method of one " dividing and rule " is used to be scanned with multiple flow line stages.Each Random length condition code is broken down into multiple fixed length feature subcodes to scan first, and each fixed length condition code or each random length are special Each fixed length feature subcode of sign code is further broken down into multiple condition code sections to scan.In one embodiment, it is first " thick The method of " close scanning " is used to be scanned with multiple pipeline sweep phase afterwards for scanning ", to search character string condition code.Every In one scan position, one or more fingerprints of fixed length condition code or fixed length feature subcode are scanned first.Further inspection Look into only need there are one or the position of multiple fingerprint matchings on carry out.In addition, finger scan can be first in each scan position Upper scanning fingerprint shadow (shadow and the relevant space of shadow will be described in detail below).Only there are one or it is multiple matched On the position of fingerprint shadow, just the fingerprint need to comprehensively be checked.
The fingerprint shadow further can be first segmented to be scanned on each of the scanning positions, then again by only checking fingerprint shadow Possible position and possible fingerprint length of the subsegment in any fingerprint synthesize fingerprint shadow section.Fingerprint shadow it is complete Face inspection only need there are one or multiple matched synthesis fingerprint shadow position on carry out.In addition, fingerprint shadow section Scanning its hashed value need to be only scanned in each scan position.It is further to check that fingerprint shadow section only have It is carried out on the position of one or more matched hashed values.In one embodiment, elder generation's handle is used to scan character before the scan The character string condition code pretreatment of string field, and pretreated condition code is stored into condition code database, to be supplied to The scanning of multiple pipeline stage uses.
Figure 1A shows a quick character string signature scan engine 100.The scanning engine includes a condition code Preprocessing module 90, a scanning pre-processing engine 120, a finger scan engine 140, a fixed length condition code Lookup engine 160 and a random length condition code Lookup engine 180.Quick character string signature scan engine 100 is for one or more words Symbol string condition code database, is scanned String field, and may send the number 190 of matched condition code back to and in institute The position in String field is stated, to identify special characteristic code.In one embodiment, condition code database includes a fingerprint Database 148, a fixed length condition code database 166 and a condition code rule database 186.
Figure 1B shows the program 91 of an each character string condition code of pretreatment.In one embodiment, first by one A random length condition code is decomposed into multiple fixed length feature subcodes and random length feature subcode, and the pass between fixed length feature subcode The information of system is stored to (step 92) in condition code rule database 186.
In order to quickly scan, step 92 export a fixed length condition code or feature subcode can be further broken down into it is more A segment (step 94) that can be examined by optimum ordered.In one embodiment, first or preceding several condition code sections are especially heavy Will, it can be as the fingerprint of fixed length condition code or feature subcode.The fingerprint of one character string condition code both can rapidly be swept It retouches, and false negative or the matched probability of false positive can be reduced.In one embodiment, the matched probability of false negative is zero. In one embodiment, the numbers of the different fingerprint length of multiple condition codes is than the number of the different characteristic code length of multiple condition codes Lack, so as to accelerate the speed for the scan method for requiring individually to scan the pattern of different length.In one embodiment, When scanning step is more than a basic unit, multiple fingerprints can be used for a character string condition code, wherein each fingerprint First basic unit displacement of relatively previous fingerprint is one or more basic in a scanning direction for first basic unit Unit.Scanning direction is the direction that a scan operation inputs the movement of scan position in String field at one.
The fingerprint of one fixed length condition code or feature subcode can be further broken into fingerprint section, as needed can be into one Step projects to one or more shadow spaces, is then added to fingerprint database 148 and thereby to (step in condition code database It is rapid 96).Since the fingerprint of different length may be by independent scan, fingerprint can further decompose into multiple fingerprint sections and carry out parallel Or serial scan.In some embodiments, fingerprint is broken down into multiple fingerprint sections, so that the number of different fingerprint segment length It is one or few more than the number of different fingerprint length, so as to accelerate the scanning for requiring individually to scan the pattern of different length The speed of method.The scanning result of fingerprint section can be then synthesized to together, to detect fingerprint.
In one embodiment, in order to further improve scan efficiency and scan complex characteristic code ability, fingerprint and its Its condition code section can first be projected to one or more shadow spaces and be scanned, and then be verified again in luv space. Shadow space can select that scanning process can either be simplified and speeded up, while can also cover being possible to for fingerprint or fingerprint section Form.One shadow space can cover multiple fingerprints or the form of condition code section.For example, in order to support to distinguish size simultaneously The single character with case-insensitive is write, shadow space can be only small letter or the only space of capitalization.As a spy Example, shadow space can be exactly luv space.
The fingerprint indicated completely or any other condition code section are still to indicate completely in the shadow of any shadow space.Cause This, the fingerprint indicated completely can always scan in addition in luv space in any shadow space.In one embodiment, it is The number in the space scanned needed for reducing, one all in one or more shadow spaces of all fingerprints indicated completely Shadow space scans so as to be scanned without fingerprint in luv space.In another embodiment, condition code database There are one shadow spaces, and all fingerprints of described document information database are all scanned in unique shadow space.
In one embodiment, fingerprint database includes one or more Bloom filters or hash table.When former hashed key It is too long or too expensive when being compared during finger scan, in order to be further reduced the general of false positive matching and Hash collision Rate, in another embodiment, fingerprint database has additional hashed value bit including one or more or a fingerprint is long The modified Bloom filter of degree or both or hash table.
Finally, the segment of the fixed length feature subcode of all fixed length condition codes or random length condition code is encoded and is stored in In fixed length condition code database 166, to search fixed length condition code or feature subcode (step 98).In one embodiment, it is described Segment can be encoded with unit membrane or subunit film, with support the matched specified conditions of character string condition code (as " and do not have to With ", " equal ", " unequal ", " case-insensitive ", " case sensitive ", " within the scope of one ", " in a range Outside ").In one embodiment, the segment of tape code film can be then compiled into a chained list or any other lookup structure (such as Tree etc.).In another embodiment, piece segment encode film or fixed length condition code or feature subcode code film can be compiled into lookup knot Structure, to save memory space.In another embodiment, one group of character string condition code can further utilize character string condition code Between different units carry out differential encoding, to form the differential data structure (such as difference tree) that can quickly search.
Fig. 1 C demonstrate the scanning imaging system 101 of a character string condition code.One String field to be scanned first have to by It decodes and is converted to (as used scanning pre-processing engine 120) required form (step of one or more follow up scan stages 102).The symbol string field passes through the shadow to its shadow and the fingerprint of one or more character string condition codes in shadow space first Son is compared and is scanned (as used finger scan engine 140), then again in original fingerprint space to any identified finger Schlieren son (step 104) is verified.Step 106 checks whether that there are one fingerprint matchings.
After the scanning, it no matching or generates one respectively there are one matching and represents the matched output of no condition code or one It is a to indicate the matched output of a few condition code.In one embodiment, finger scan engine 140 can provide zero false negative Matching and a sufficiently small false positive matching probability.If without fingerprint matching, the scanning of present scanning position has been completed, can To be moved to next scan position (step 108).If there is fingerprint matching, a few matched condition code will be made into one The lookup (as used fixed length condition code Lookup engine 160) of step, to be more clearly identified as fixed length condition code or random length feature The fixed length feature subcode (step 110) of code.
Step 112 checks whether that there are one fixed length condition codes or feature subcode to match.If do not matched, Current Scan position The scanning put has been completed, and can be moved into next scan position (step 108).If there is one or more fixed length condition codes Match, export the number of each matched fixed length condition code, and terminate the scanning (step 118) of present scanning position.If there is making The fixed length feature subcode of a part for one or more random length condition codes is identified, and matched feature subcode will be by dynamic Ground synthesis (as used random length condition code Lookup engine 180) is with the one or more random length condition code (steps 114) of detection.Step 116 check whether that there are one random length condition codes to match.If do not matched, the scanning of present scanning position has been completed, can be with It is moved to next scan position (step 108).If there is one or more matches, each matched random length condition code is exported Number, and terminate present scanning position scanning (step 118).
In the preprocessing process 91 of condition code, in one embodiment, the matched probability quilt of false positive of each fingerprint It is stored in fingerprint database 148.If step 106 has a fingerprint matching, the false positive of the matched fingerprint is matched general Rate will be checked.If the matched probability of false positive is sufficiently low (as less than the threshold value specified), present scanning position is swept It retouches process to have completed, next scan position (step 108) can be moved to.In another embodiment, in all databases Fingerprint the matched probability of false positive it is all sufficiently low, therefore be not required to store and check the matched probability of false positive of fingerprint.When The scanning process of preceding scan position has been completed after finger scan, can be moved to next scan position (step 108).
In a step 102, scanning pre-processing engine 120 first decodes the String field, standardizes, and be transformed to The form identical with the condition code in condition code database.In one embodiment, character string signature scan is in entire word It is carried out on symbol string field.However, in another embodiment, limitation or low latency due to the memory space of some systems will It asks, storing entire String field can not possibly.Therefore, in step 102, the String field can be broken down into several pre- Determine the string chunk of size.Character string signature scan is carried out on the character string block of each predefined size.
After a string chunk is loaded, the string chunk will be decoded, normalization, and sweep phase institute after being transformed to The different-format needed.In one embodiment, decoding and normalisation process can support different compressed format (such as LZS, PKZip and gzip), different coding standard (such as UU is encoded, MIME codings, HTML and XML) and deletion are random " counter-scanning " Junk data.
In one embodiment, decoded String field will further be projected to one or more features code data The shadow space of library requirement, to support complicated condition code.For example, decoded String field is converted into full small letter (a such as shadow space), to support the character string signature scan of case-insensitive.Character string signature scan can be first It has decoded in full small letter and has been carried out on String field, then again with the former case sensitive String field of decoding and full small letter String field has been decoded to be verified.
At step 104, finger scan can identify that its shadow is the fingerprint indicated completely first.It is a large amount of in order to quickly scan And complicated condition code, in one embodiment, finger scan engine 140 can use one or more hash tables or the grand filtering of cloth Device scans multiple basic units simultaneously.In one embodiment, finger scan engine 140 can be multiplexed and fingerprint length with hashed value The service efficiency and the probability of the matching of reduction false positive and fingerprint collision for being multiplexed to improve reservoir.Hashed value is multiplexed and fingerprint is long Degree multiplexing can further reduce false positive matching while zero false negative matching (missing a condition code matching) is ensured The probability of (i.e. wrong condition code matching).
Step 110 scans fixed length condition code.The fixed length feature subcode of fixed length condition code or random length condition code can be described The fixed length signature scan stage identifies.Fixed length signature scan only need to be when at least one fingerprint matching during finger scan Just perform.The fixed length condition code corresponding to matched fingerprint or feature subcode can be matched or one by one with other lookups Structure (such as setting) is matched.In one embodiment, the tape code film of basic unit or sub- basic unit can relatively be used for The matched specified conditions of support character string condition code (such as " do not have to matching ", " equal ", " unequal ", " case-insensitive ", " case sensitive ", " within the scope of one ", " outside a range ").
Step 114 scans random length condition code.In one embodiment, random length signature scan only need to have one in scanning It is just performed during the random length condition code of a or multiple random length characters.The fixed length feature subcode of random length condition code is in fixed length feature Code sweep phase is identified, and identified fixed length feature subcode can be dynamically connected to one in the random length signature scan stage It rises, to synthesize one or more former random length condition codes.The synthesis can be dynamic with a static monitor rule list and one State synthetic state table is realized.The composition rule table indicates the rule that fixed length feature subcode is synthesized to random length condition code, The synthetic state table then safeguards current synthetic state according to the composition rule table.
The pretreatment of condition code database
In one embodiment, it is special before the String field is scanned in order to improve sweep speed and memory efficient Sign code preprocessing module 90 pre-processes condition code.Before condition code database is stored in, condition code preprocessing module 90 First condition code can be decomposed, convert and be encoded into one or more forms.In one embodiment, condition code pretreatment mould Block 90 can build and safeguard a fingerprint database 148, a fixed length condition code database 166 and a condition code rule Database 186.
When condition code database there are one or multiple condition codes contain one or more random length subcodes and (such as represent more than zero " * " of arbitrary basic unit or represent " (bc) { 3-6 } " that " bc " repeats 3 to 6 times), each such random length condition code Can first multiple fixed length feature subcodes be broken down into random length feature subcode.If for example, a condition code is " subcode 1* Code 2* subcodes 3 ", wherein subcode 1, subcode 2 and subcode 3 are all free from the fixed length subcode of random length character, and described document information can be with It is broken down into subcode 1, subcode 2, subcode 3.Each " * " in subcode 1* subcode 2* subcodes 3 can be by a random length subcode Substitution.In one embodiment, each fixed length feature subcode can be first scanned independently, and then be synthesized original again Random length condition code.
In one embodiment, condition code rule database 186 can be with the location information of each fixed length feature subcode (as follows Sequence, last subcode mark, to the distance or distance range of next fixed length subcode) it builds, for synthesis fixed length feature Code.In one embodiment, it is described when the random length subcode between two continuous fixed length feature subcodes is not " do not have to matching " Condition code rule database 186 can be further containing the description to the random length subcode, for synthesizing fixed length feature subcode. In another embodiment, the fixed length feature subcode and the one or more finite automatas of random length feature subcode structure can be used (FA), for synthesizing fixed length feature subcode, wherein each fixed length feature subcode is as a generally incoming symbol.
In one embodiment, a fixed length condition code or feature subcode will be further broken down into multiple can use most The segment that good sequence is searched (these segments include the fingerprint of described document information or feature subcode).The multiple Section can have different sizes or identical size.False negative matches or does not miss any one condition code in order to prevent, owns The set of the segment is equal to original long condition code.During signature scan, with the increasing of the number of matched segment Add, false positive matching value will reduce (i.e. matched confidence level increases).The process of scanning is terminated at first unmatched Section or the last one matched segment.In one embodiment, the selection of the segment of condition code, which can allow, terminates or is known without matching The other matched condition code matching of zero false positive occurs as early as possible.
In one embodiment, the fingerprint includes multiple segments.In another embodiment, there are one the fingerprints Segment, and triple { segment, length, dislocation } is encoded as, wherein segment is first of character string condition code scanned The fingerprint of segment, that is, described document information, length are the length of the fingerprint, misplace as the fingerprint in a fixed length condition code or Dislocation in feature subcode.Particular fingerprint is the particular patch of the fixed length feature subcode of a fixed length condition code or random length condition code Section.
In one embodiment, the shadow space can be chosen simplify and accelerate the mistake of described document information scanning Journey, while the form of a variety of fingerprints or feature chip segment can be covered.In the ideal case, the phantom value can be direct As a hashed key.For example, in order to support case sensitive and case-insensitive unit simultaneously, the shadow space can With the space for being full small letter or capitalizing entirely.For example, in order to scan the drivers license number for being followed by 7 numbers by a letter and being formed, Wherein each letter and number can further specify that the range for arbitrary letter and number, and shadow space can use a code Or any one alphabetical (such as " a ") replaces owning instead of all letters, and with another code or any one number (such as " 0 ") Number.For example, in order to scan the Social Security Number formed by three groups by three bit digitals that space or "-" separate (SSN), wherein each number can further be designated as arbitrary digital scope, shadow space can use a code or any One number replaces all numbers, and replace space and "-" with another code or space or "-" (such as " 0 ").As a spy Example, the shadow space can be exactly former space.
In one embodiment, for fingerprint after shadow space has been scanned, the verification of fingerprint can detect fingerprint It is carried out at once in former space after shadow.In another embodiment, the verification can formerly verify the part of other segments Or it is carried out again after whole.If the fingerprint is covered completely by other segments, it is not required to verify.
The selection of fingerprint should accelerate the speed of the finger scan, while also provide minimum after finger scan The matched probability of false positive.In one embodiment, the fingerprint can be arbitrary dimension and appointing in described document information It anticipates on position.In another embodiment, in order to meet the requirement of system, the size of the fingerprint or in described document information Position is restricted.For example, in order to meet the delay requirement of system, the dislocation of the fingerprint is no more than certain particular value.
In one embodiment, fingerprint can select to next or multiple conditions:
1) shadow of the fingerprint does not have asterisk wildcard or range, with by compared with short scan,
2) the probability very little that the fingerprint occurs in the String field to be scanned,
3) number for the fingerprint that multiple condition codes share it is small as much as possible and
4) number there are one the fingerprint of identical fingerprint section is small as much as possible.
The condition of other selections can increase according to the requirement of system.In common network application and non-network application In, usually all or most numeric string condition code is all at least considerably long containing one section, is projecting to a selected shadow Without asterisk wildcard or the segment of range after in space.In one embodiment, first alternative condition is a necessary condition. In another embodiment, first alternative condition can be further limited to each fingerprint of all fingerprints at least in a shadow It is to indicate completely in subspace.Condition code without the fingerprint for meeting alternative condition, can be extended to it is multiple containing meet choosing The different scanning mode selected the condition code of the fingerprint of condition or condition code extension is not required to one scans.
The fingerprint can be selected by checking the segment of all condition codes for meeting first choice condition.It is other The parameter of fingerprint can also be contemplated when fingerprint is selected.According to the second alternative condition, select described to be scanned String field occur probability very little condition code section as fingerprint, it is possible to reduce the probability of false positive fingerprint matching.This Outside, the number of the fingerprint of identical fingerprint section is small as much as possible there are one allowing, and can be further reduced false positive fingerprint matching Probability.Although the length of the fingerprint can be less than 8 basic units or more than 32 basic units, it is that typically in 8 to 32 Between basic unit.
Due to described document information can very long (such as hundreds of or thousands of a basic units), the length of fingerprint may be also very much. However, in one embodiment, the fingerprint of different length will be scanned individually, so as to which sweep speed is slower.In one embodiment, In order to reduce the complexity of scanning, can be required to limit the number of fingerprint length with system architecture according to particular system, it is such as few In 16.In one embodiment, the length of fingerprint can be selected from a scheduled length table.In addition, the length of fingerprint can With according to system requirements and system architecture, by exponential increasing relationship (such as 2,4,8,16 and 32), linear increment relationship (such as 4 times Number:32) or another relationship (such as 2,3,5,8,13,21,34) selects 4,8,12,16,20,24,28 and.
In one embodiment, the fingerprint of a condition code can be selected with an algorithm.For example, following algorithm can be with For selecting the fingerprint of a fixed length condition code or feature subcode, (assuming that scanning step is equal to 1, the length of fingerprint is from being most as short as most Length is fixed as l0, l1, l2..., lm-1, lm, finger scan, which is segmented, to carry out, and the shadow space of finger scan has given):
1. it is the condition code subsegment indicated completely in shadow space to find all.
2. couple each l for being longer thanmThe subsegment, find all length equal to lmSubsegment.
3. pair each length is equal to lmSubsegment, write down the times N that the subsegment has been chosen as fingerprint in all condition codesc With the times N for having the first identical fingerprint section with other fingerprintssAnd it is based on N with onec, NsAnd lmCost function calculate into This value.
4. it is equal to l with length respectivelym-1..., l2, l1And l0Subsegment repeat step 2 and 3.
5. from step 2 to 4, the subsegment for finding out minimum cost is a fingerprint.
Above-mentioned steps are related with the processing sequence of condition code.Several random process sequences can be used for finding not as needed Same fingerprint.In one embodiment, value at cost is equal to (m-i), Nc, and NsIt is linked up from highest order to lowest order, wherein I=0,1,2 ..., m and i are the length of fingerprint.In one embodiment, it if having found the fingerprint of specific length, does not need to again All shorter subsegments are selected.
In another embodiment, each mobile scanning step of finger scan engine 140.In each scan position On, finger scan engine 140 serially or parallelly scans the fingerprint of different length.Therefore, sweep speed (connects with scanning step Continue the number of the basic unit between two scan positions) it is directly proportional.In one embodiment, in order to improve sweep speed, refer to Line scanning can scan multiple basic units rather than a basic unit simultaneously.In order to ensure zero false negative matches, each word Symbol string condition code is equal to the scanning step with the number of multiple fingerprints and the fingerprint.In other words, a fixed length condition code Or condition code database can be put into multiple fingerprints for feature subcode and the number of the fingerprint is equal to scanning step.Specific spy First basic unit for levying the j-th fingerprint of code is (J+k*S) of the special characteristic code in scanning direction a substantially single Member, wherein S be scanning step, k be a nonnegative integer and J=0,1,2 ..., S-1.Described document information can be described It finds any position in string segments to be scanned.For example, if special characteristic code is " [Rr] [Ee] [Aa] [Dd] [Mm] [Ee] 123.exe ", the scanning step are 4, and the fingerprint length includes 4,8 and 12, can select following four fingerprint: " [Rr] [Ee] [Aa] [Dd] [Mm] [Ee] 123.ex ", " [Ee] [Aa] [Dd] [Mm] [Ee] 123.exe ", " [Aa] [Dd] [Mm] [Ee] 123. " and " [Dd] [Mm] [Ee] 123.e ", wherein [Rr], [Ee], [Aa], [Dd] and [Mm] are respectively not differentiate between greatly English alphabet r, e, a, d and the m of small letter." [Rr] [Ee] [Aa] [Dd] [Mm] [Ee] 123.exe " is put into four fingerprints To fingerprint database four times.When scanning step is 1, only needs a fingerprint and need to only be put into primary.
The scanning of the multiple fingerprint can be from the preceding S unit of input String field (i.e. in the first scanning step It is interior) any position start.For example, in one embodiment, the scanning will scan (k*S) position since the 0th position It puts, wherein k is a nonnegative integer.In any input String field, (k*S) position is covered by the 0th fingerprint, the (k*S+1) position is covered by (S-1) a fingerprint, (k*S+2) position by (S-2) a fingerprint covering ..., with (k*S+ S-1) position is covered by the 1st fingerprint.In another embodiment, it scans since (S-1) position, (k*S+ will be scanned S-1) position, wherein k are a nonnegative integer.In any input String field, (k*S) position is by (S-1) a finger Line covers, and (k*S+1) position is covered by (S-2) a fingerprint, (k*S+2) position by (S-3) a fingerprint covering ..., (k*S+S-2) position is covered by the 1st fingerprint and (k*S+S-1) position is covered by the 0th fingerprint.
In order to which the scanning step is increased to S basic unit, in one embodiment, using scanning step of being taken in as 1 When be that each fixed length condition code or feature subcode select the algorithm of a fingerprint that can make following modification, it is each to be used for selecting The S fingerprint of fixed length condition code or feature subcode:Step 1 to 4 and in the past it is just the same, step 5 be then revised as from own The dislocation in scanning direction of fixed length condition code find in step 2, described or feature subcode for (J+k*S) subsegment in, The subsegment for finding out minimum cost value is the j-th fingerprint in the S fingerprint, and wherein J=0,1,2 ..., S-1 and k are one Nonnegative integer.
Only there are one condition codes for usually each character string object, and to support the scanning step of S basic units, each fixed length is special The fixed length feature subcode of sign code or random length condition code needs S fingerprint.In another embodiment, to support S basic units Scanning step, S fixed length condition code can be used for identifying a character string object, so that each fixed length condition code is being swept It is (J+k*S) that first basic unit in direction, which is retouched, in the dislocation of the character string object, wherein J=0,1,2 ..., (S-1) It is a nonnegative integer with k.The specific character string object can be on any position of any String field to be scanned It is identified.In one embodiment, the scanning of S fixed length condition code can not have to fingerprint.In another embodiment, determine for S Each condition code in long condition code can carry out signature scan with a fingerprint.
In one embodiment, multiple random length condition codes based on multigroup nonoverlapping orderly fixed length condition code can be with Further it is selected to one character string object of identification.Every group of fixed length condition code has S fixed length condition code, and wherein j-th is determined Long condition code scanning direction first basic unit the character string object dislocation for (J+k*S), wherein J=0,1, 2 ..., (S-1) and k be a nonnegative integer, so as to form SnA random length condition code, wherein n are that nonoverlapping orderly have The group number of the code character of S fixed length condition code.A fixed length condition code is selected from every group of fixed length condition code, then again by these Fixed length condition code and one or more random length string segments are combined into SnA random length spy in a random length condition code Code is levied, so that the specific character string object can be identified on any position of any String field to be scanned. Each original long condition code in the n groups S fixed length condition codes becomes the random length condition code after multiple synthesis in post synthesis One fixed length feature subcode.In one embodiment, the scanning of the random length condition code can not have to fingerprint.In another reality It applies in example, each fixed length condition code in the random length condition code can select a fingerprint to scan.
In one embodiment, the scanning step for support S basic units, specific character string object can select P to determine Long condition code, each fixed length condition code can further select one or more fingerprints, so that each character string object is total Fingerprint number is equal to S.J-th fingerprint in S fingerprint of the specific character string object is substantially single at first of scanning direction Dislocation of the member in the specific character string object is (J+k*S), wherein J=0,1,2 ..., (S-1) and k be one non-negative whole Number, so that the character string object can be identified on any one position of any String field to be scanned.
In another embodiment, multiple random length condition codes based on multigroup nonoverlapping orderly fixed length condition code can To be further selected to one character string object of identification.I-th group is not overlapped and orderly fixed length condition code has PiA fixed length Condition code, wherein each fixed length condition code further there are one or multiple fingerprints so that total finger of i-th group of fixed length condition code Line number is not to be overlapped and the group number of orderly fixed length feature code character equal to S, wherein i=0,1,2 ..., n-1 and n.Every group of fixed length J-th fingerprint in S fingerprint of condition code scanning direction first basic unit in the specific character string object Dislocation for (J+k*S), wherein J=0,1,2 ..., (S-1) and k be a nonnegative integer.
A fixed length condition code is selected from every group of fixed length condition code, then again by these fixed length condition codes and one or Multiple random length character string combinations are into P0*P1*P2…*Pn-1A random length condition code in a random length condition code, so as to make Obtaining the specific character string object can be identified on any position of any string segments to be scanned.The n groups fixed length is special Each original long condition code in sign code becomes a fixed length feature subcode of the random length condition code after multiple synthesis.
In the condition code of each character string object of a scanning system, after fingerprint and shadow space determine, a finger The shadow of line can be scanned as an entirety.The fingerprint shadow of different length can be scanned concurrently or sequentially.At one In embodiment, the shadow of the fingerprint of different length can be used as an entirety to carry out serial scan.In one embodiment, one A character string condition code is incorporated into described document information database and can be used to lower pseudocode:
Wherein S is scanning step, hiIt is the length of the fingerprint of i-th of character string condition code, IV is original Hash value and hash () is a sequence hash function.Fingerprint selects the best fingerprint that () selects for each shift position, and condition code is incorporated into () spy Sign code is programmed into the database of the fixed length condition code Lookup engine 160 and random length condition code Lookup engine 180 and refers to Line is incorporated into () and fingerprint is programmed into the fingerprint database 148.When scanning step is more than 1, the fingerprint of each condition code Fingerprint will be called to be incorporated into () primary.But because all fingerprints of a condition code all point to same condition code and search Data structure, as long as all fingerprints of a condition code call altogether condition code to be incorporated into () once.
In one embodiment, following pseudo- generation can be used by deleting a character string condition code from described document information database Code:
Wherein S is scanning step, hiIt is the length of the fingerprint of i-th of character string condition code, IV is original Hash value, hash () is a sequence hash function, and fingerprint deletion () deletes a fingerprint from the fingerprint database 148 and condition code is deleted Except () by the condition code searching data structure of described document information from the fixed length condition code Lookup engine 160 and random length feature It is deleted in the database of code Lookup engine 180.It, be from spy because a condition code has multiple fingerprints when scanning step is more than 1 It levies and a condition code is deleted in code database, fingerprint deletes () and to be called repeatedly.But all fingerprints of special characteristic code It is primary that as long as condition code is called to delete () altogether.
Because number of the number of different fingerprint segment length usually more than different fingerprint length is few, in one embodiment, In order to improve scan efficiency, a fingerprint can be further broken down into multiple fingerprint sections.All fingerprint sections of one fingerprint can First concurrently or sequentially to scan, the scanning result of the fingerprint section concurrently or sequentially is synthesized again, to detect fingerprint.It is described The length of fingerprint section can be identical or different, the length depending on the fingerprint that specific scanning engine is supported.
In one embodiment, the number of fingerprint section and length can be according to its of the length of fingerprint and specific scanning engine Depending on its sweep parameter.In another embodiment, between fingerprint length for linear relationship and all fingerprint sections length all It is identical.In general, the length of a fingerprint section is equal to one or more scanning steps.In another embodiment, scanning step is The integral multiple of the length of one fingerprint section and multiple fingerprint sections will be synthesized concurrently.
In one embodiment, fingerprint database can include one or more Bloom filters and one or more hash One or more of table.The size and bandwidth of memory should be weighed usually using Bloom filter or hash table.Work as condition code Number can all be stored in the on-chip memory of sufficiently large bandwidth, Bloom filter or may be more even more than hash table It is good.On the contrary, the yardage when condition code is more, thus when must use chip external memory, the bandwidth of memory just becomes main limitation, The hash table rather than the Bloom filter may be more preferable.
In one embodiment, the added bit of hashed value is stored in a Bloom filter or a hash table, To realize that hashed value is multiplexed.In another embodiment, fingerprint length is stored in a Bloom filter or a hash table In, to realize that fingerprint length is multiplexed.When former hashed key is too long or too expensive when being compared in scanning fingerprint, hashed value is multiplexed And the multiplexing of fingerprint length can further reduce the probability of false positive matching and fingerprint collision.
In one embodiment, work as shadow space, fingerprint, after fingerprint section and finger print data structure are determined to, one fixed The fixed length feature subcode of long condition code or random length condition code can be used as an entirety or segmentation to be programmed into shadow space In the fingerprint database 148 and condition code database.
In one embodiment, the fingerprint of a fixed length feature subcode of a fixed length condition code or random length condition code with Outer other segments can be encoded and stored fixed length condition code database 166, entire to be used for scanning after fingerprint matching Fixed length condition code or feature subcode.In one embodiment, feature of a fixed length condition code or random length condition code One or more segments of code can be encoded into multiple tape codes with the code film of one or more basic units or sub- basic unit The segment of film, to support the matched specified conditions of character string condition code (such as " not having to matching ", " equal ", " unequal ", " not area Divide capital and small letter ", " case sensitive ", " within the scope of one ", " outside a range ").In one embodiment, in order to improve Storage efficiency, one or more features chip segment encode film or one or more features code or feature subcode code film can individually make It is used together with or with one or more basic units or subunit this element number film.
In one embodiment, the condition code section of tape code film can in order or out of order be linked together with chained list, Or pretreatment (is such as set) into other lookup structures.In one embodiment, the length of the condition code section of all tape code films can be complete It is equal or not congruent.The length of the condition code section of the tape code film can most preferably be selected according to the framework of particular memory.
In another embodiment, it is allowed in order to which the false positive matching of character string condition code is allowed to converge to zero-sum as quickly as possible Possible matched condition code converges to zero or one as quickly as possible, and one group of character string condition code can further carry out difference Coding, to build the differential data structure (such as difference tree) that can quickly search.Difference tree can utilize the difference between condition code Unit is built.
In one embodiment, when the one or more new condition codes of addition or as the one or more existing spies of deletion When levying code, condition code database preprocessing can only be carried out in the initial establishment of condition code database.In another embodiment, The database of condition code can dynamically update in the scanning process of condition code.
The scanning pretreatment of String field
The scanning pre-processing engine 120 is different String field pretreatment according to character string condition code database Form, to simplify and speed up scanning flow line stage thereafter.In one embodiment, the condition code in condition code database It is stored with uncoded form.Therefore, scanning pre-processing engine 120 decodes encoded String field, with the spy The codec format for levying the condition code in code database is consistent.As shown in Figure 1A, it is defeated to include a scanning for scanning pre-processing engine 120 Send device 122, a String field memory 124, a format decoder 126, a decoding field Memory 128, one Projector 130 and a shadow field Memory 132.The scanning conveyer 122 is first by data to be scanned from the character string Field Memory 124 is loaded into the format decoder 126, and the format decoder 126 again will be decoded original field, Parsing and decompression, wherein may include that MIME is decoded, the decoding of UU, foreign language decoding removes and includes skimble-skamble character string (such as Additional white space) and counter-scanning garbage character string data (such as the counter-scanning junk data of injection), HTML parsings, XML solutions Analysis, deflate decompressions, LZS decompressions, PKZip decompressions and gzip decompression.The format decoder 126 can be with root Normalized character string field is needed according to particular system.Hereafter, the format decoder 126 is by the character after decoding and normalization In string field deposit decoding field Memory 128.
Decoded field is projected to one or more shadow spaces, and shadow data are stored by the projector 130 Into shadow field Memory 132.For example, a condition code database can include the condition code of case-insensitive.For Support the condition code of case-insensitive, the projector 130 helps the data conversion decoded in field Memory 128 small It writes, and in the storage to shadow field Memory 132 of full lowercase character string field.Full lowercase character string field can be swept by fingerprint Engine 140 is retouched to be used for carrying out finger scan.Case sensitive condition code and the condition code of case-insensitive can be simultaneously It is scanned.The matching of case sensitive condition code can be after the matching of its shadow, then with case sensitive decoded Field is verified.
In one embodiment, quick character string signature scan engine 100 includes to support to entire String field Carry out the computing resource or the network equipment of character string signature scan.However, in another embodiment, quick character string feature Code scanning engine 100 includes that the computing resource or the network equipment of entire String field can not be stored, due to for example by system Memory space limits and low time delay requirement.Therefore, the String field can be broken down into the character string of multiple predefined sizes Block.The scanning of character string condition code carries out on the string chunk of each predefined size.
In one embodiment, the size of a string chunk depends on longest character string condition code.The string chunk It can be further separated into three regions:One finger scan region is to cover all signature scan positions, in fingerprint before one The region of scanning area is to provide the reference data before or during fingerprint, and one after the region in finger scan region to provide After fingerprint or among reference data.The set in all finger scan regions of specific character string field is covered described All possible fingerprint initial position in String field.The trizonal size can be identical or different.At one In embodiment, the preceding minimum dimension in the region in finger scan region is characterized the maximum fingerprint of all condition codes in yard database Dislocation, the minimum dimension after the region in finger scan region are characterized the length of all condition codes and dislocation in yard database Maximum difference, the minimum dimension in finger scan region is scanning step.
In one embodiment, it is described scanning string chunk size can according to the parameter of the scanning system, for example, The length of longest condition code, internal storage structure and sweep speed select.When longest condition code is longer, character string is scanned Block can be several times of the length of longest condition code, for example, 2 to 4 times.When longest condition code is shorter, string chunk is scanned It can be the more times of the length of longest condition code.
In one embodiment, the trizonal size for scanning string chunk is identical, is equal to longest condition code Length.Three regions can be stored in the ring formed with three pieces of memories, to reduce the movement of data in memory.Another Described trizonal of different sizes in one embodiment, finger scan region is smaller than other two regions, other two regions Size can be identical or different.
In one embodiment, trizonal each region of a scanning block can be stored in one or more In a an equal amount of memory block, all memory blocks in all three regions form a past in the region in finger scan region First memory BOB(beginning of block), to the ring of the last one memory block end after the region in finger scan region, to reduce number According to movement in memory.After first memory block in finger scan region is scanned through, first memory block of the ring will The ring is exited, can be used for loading new data.After new data are loaded, the memory block will be used as the last one memory block Add in the ring.In one embodiment, one or more add-in memories blocks can add in the ring tail of the ring, new for loading Data.
In one embodiment, the String field memory 124 only includes the last one memory block.Therefore, it is described The size of String field memory 124 is equal to the size of a memory block.The decoding field Memory 128 and the shadow Field Memory 132 includes all memory blocks in all three regions, and size is the sum of all memory block sizes.
Due to boundary condition, described first piece there are specific conditions with last block.In one embodiment, length is most " impossible " basic unit of the length of long condition code can be filled into the preceding reference zone in the finger scan region Before first piece and after last block of the rear reference zone in the finger scan region." impossible " basic unit is included not There is the basic unit that condition code is started or ended up with it, therefore, the region of the filling can not possibly be as the feature of a reality A part for code.If finger scan engine 140, fixed length condition code Lookup engine 160 and random length condition code Lookup engine Have scanning range checking mechanism in 180, the filling there is no need to.Scanning range checking mechanism can prevent scanning beyond character The boundary of string field.
Finger scan
In one embodiment, the finger scan engine 140 can include one or more content adressable memories (CAM) and one or more of one or more finite automata (FA).In another embodiment, when original fingerprint or refer to For line when the shadow of a shadow space is to indicate completely, the finger scan engine 140 may be based on the engine of hash.Such as figure Shown in 1A, in one embodiment, the finger scan engine 140 includes a finger scan controller 142, and a fingerprint dissipates Column counter 144, a fingerprint finder 146, a fingerprint database 148 and a fingerprint synthesizer 150.In a reality It applies in example, a fingerprint is integral to scan, and therefore, the fingerprint synthesizer 150 is optional component.
In another embodiment, each fingerprint is broken down into multistage by independent scan.All fingerprint sections can first quilt Then scanning synthesizes (as using fingerprint synthesizer 150) to generate the scanning result of fingerprint by serial or parallel again.In a reality It applies in example, the finger scan controller 142 controls entire scanning process.
140 exportable one, the finger scan engine without occurrence or with a matched knot in fingerprint database Fruit.The occurrence, which corresponds to one or more, then to be looked by fixed length condition code Lookup engine 160 and random length condition code The character string condition code that engine 180 is looked for search.If without occurrence, scanning process is just completed.
In one embodiment, the fingerprint hash calculator 144 includes multiple independent universal hash functions, h0, h1..., hI, to support Bloom filter.For example, when memory size rather than memory Bandwidth-Constrained when, the grand mistake of cloth can be used Filter.For example, the condition code database when scanning system is sufficiently small, when can all be stored in on-chip memory, memory it is big It is small to be restricted.In another embodiment, the fingerprint hash calculator 144 includes multiple independent universal hash functions, h0, h1..., hI, to support multiple hash tables.For example, when false positive matching needs very low (such as 10-3Or smaller), the bandwidth of memory When sufficiently wide and size is sufficiently large, multiple hash tables can be used.For example, when the slower chip external memories of DRAM or other are used for The data structure in follow up scan stage is stored, and when on-chip memory stores multiple hash tables enough, it is desirable to which false positive matching is non- It is often low.
In another embodiment, the fingerprint hash calculator 144 only includes a hash function h0.For example, when interior When the bandwidth deposited rather than the limited size of memory, a hash table can be used.For example, when the feature yardage of scanning system It is sufficiently large according to library, thus when must use chip external memory, the Bandwidth-Constrained system of memory.
The fingerprint hash calculator 144 extracts the data of n byte from the shadow field Memory 132, and calculates Its hashed value.The data can be used for calculating working as all hash functions individually or together with random starting values or preceding hashed value Preceding hashed value.In one embodiment, one of hash function is (for example, first hash function h0) hashed value bit Digit is more than the number of bits of the hashed value of other hash functions, and the number of bits phase of the hashed value of other hash functions Together.
The hashed value of hash function output can be used for searching fingerprint database 148 by fingerprint finder 146.In a reality It applies in example, the fingerprint finder 146 includes a Bloom filter and a hashed value demultiplexer.The Bloom filter Check the effective marker pointed by the hashed value.If all effective markers are all effective, the further root of hashed value demultiplexer A corresponding Hash block is found out according to the additional bit of hashed value.The hashed value demultiplexing is exactly to check particular Hash value Additional bit, hashed value demultiplexing can individually carry out or with check fingerprint length and it is other with the relevant information of fingerprint simultaneously into Row.Hash demultiplexing can further reduce the matched result of false positive condition code.In one embodiment, the grand filtering of the cloth Device can be condensed to a hash table.In another embodiment, the Bloom filter can expand as multiple hash tables.
The fingerprint of different length can be by parallel scan or serial scan.In one embodiment, multiple fingerprint finders 146 can be used for carrying out parallel scan simultaneously.For example, a fingerprint finder 146 can be used for scanning each different length Fingerprint.
In another embodiment, a fingerprint finder 146 can carry out serial scan to the fingerprint of different length.Example Such as, the length of fingerprint can be the integral multiple of scanning step.Each quick character string signature scan engine 100 is supported one group and is referred to Line length, { S, 2*S, 3*S ..., m*S }, wherein S is scanning step and m*S is the length of longest fingerprint.In each scan position On, the fingerprint of different length is by serial scan.The sequence hash function of one input S basic unit can be used for described Scanning.In one embodiment, the serial finger scan of the finger scan engine 140 can be described with following pseudocode:
Wherein S is scanning step, and m*S is longest fingerprint length, and t is the total length of String field to be scanned, IV It is original Hash value and hash function () is a sequence hash function.Fingerprint searches () with fingerprint finder 146 and fingerprint Synthesizer 150 realizes that condition code searches () with fixed length condition code Lookup engine 160 and random length condition code Lookup engine 180 To realize.
Fig. 2A-C demonstrate a fingerprint database 148 when the fingerprint of different length is scanned as an entirety Data structure, including a Bloom filter table 200, a fingerprint hash block chained list 250 and a Hash-entry block 256 pieces.The one group of hashed value given with fingerprint hash calculator 144, { hashed value0A, hashed value1..., hashed valueI, fingerprint Finder 146 goes to read the Bloom filter table 200.Each of the Bloom filter table 200 includes an effective marker 202 With a Hash block chain table pointer 204.Effective marker 202 is that there are the marks set during at least one condition code when input.
In one embodiment, the only hashed value of Hash block chain table pointer 2040AMeaning, and it is directed toward fingerprint hash block chained list 250 first term.If the effective marker 202 of all hashed value meanings is all set, further finger scan can be in Hash block chain It is carried out on the signified fingerprint hash block chained list 250 of list index 204.In another embodiment, Hash block chain table pointer 204 is equal to Hashed value0A, can be omitted, to reduce Bloom filter table or hash table, cost is to increase the data of sweep phase thereafter Structure.
In one embodiment, hashed value0AFingerprint is used to be added in Bloom filter table 200, and it is other scattered Train value can be used to reduce false positive matching.In addition, in order not to accidentally delete character string condition code, each single item of Bloom filter table 200 The number of its condition code is tracked with a counter.In another embodiment, the Bloom filter table 200 can be condensed to One hash table only needs a hashed value (such as hashed value0A).In another embodiment, the Bloom filter table 200 can be with Expand as multiple hash tables.
In one embodiment, the fingerprint hash block chained list 250 is a chained list.Fingerprint hash block chained list 250 it is every One includes a last block 252, a next block pointer 254 and a Hash-entry block 256.Next block pointer 254 is directed toward institute State the next item down of fingerprint hash block chained list 250.When one of the fingerprint hash block chained list 250 being tail item, last block 252 will be by Setting.Because tail item can also be known by checking whether next block pointer 254 is null pointer, last block 252 be one be used for it is fast Speed detects the optional domain of tail item.Each Hash-entry block 256 includes most n fingerprints, wherein n be it is any be more than zero it is whole Number.In one embodiment, a best n can be selected according to the memory architecture of scanning system.For example, if interior save as SRAM, n can be equal to 1;If inside saving as DRAM, n>1.
In one embodiment, as shown in Figure 2 C, each fingerprint item of the Hash-entry block 256 includes a hashed value (hashed value0B) 260, a fingerprint length 262, a type 264, a feature code group pointer 266 or condition code pointer 268, With a dislocation 270.Hashed value0B260 and fingerprint length 262 be respectively intended to realize hashed value multiplexing and fingerprint length multiplexing.When To be that a feature code group exports a feature code group pointer 266 when type 264 is zero;Otherwise, will be that a condition code exports One condition code pointer 268.Dislocation 270 for fingerprint first basic unit to next subsegment to be compared first base The dislocation of this unit.If without next data structure, it is just not required to dislocation 270.When being not required to dislocation 270, dislocation 270 can be with It is set as zero.In another embodiment, work as n>When 1, each fingerprint item can increase an effective marker.There is criterion when described When will is set, hashed value may compare0B260 and fingerprint length 262.
In one embodiment, each fingerprint item of Hash-entry block 256 stores a fingerprint.However, in another reality It applies in example, because fingerprint can be very long, size is again different, and because has the lookup of other feature code after finger print data library lookup Stage, original fingerprint can not be stored in each fingerprint item.Therefore, each matched fingerprint item may be due to false positive Matching is corresponding to zero fingerprint or since fingerprint collision corresponds to multiple fingerprints.When with a simple hash table rather than one A Bloom filter table scans fingerprint and (k/2m)<<When 1, the order of magnitude of the matched probability of false positive is (k/2m) and fingerprint touch The order of magnitude of the probability hit is (k2/22m), it in total fingerprint number and m is to include hashed value that wherein k, which is,0AAnd hashed value0BHashed value0 Total bit.
When with a Bloom filter table, two above probability will decline to a great extent.When with multiple hash tables, described two A probability will be further compromised.It is searched to reduce rear phase characteristic code to the greatest extent, the probability of false positive matching and fingerprint collision can To reduce to close to zero.In one embodiment, the number of sufficiently large m and hash function can be used for reducing false positive matching With the probability of fingerprint collision.
In order to reduce memory space, in one embodiment, multiple hashed values can be multiplexed into a fingerprint item, to reduce The probability of null term.Hashed value m each can be divided into two parts:The hashed value of m11The hashed value of (m-m1) position2.It dissipates Train value1It is used as the address of fingerprint database, and hashed value2(such as hashed value0B260) it can be used to solve Hash collision and vacation sun Property matching.M1 is smaller, and required memory space is fewer, but fingerprint hash block chained list 250 is longer.In one embodiment, when 2m1When being more than twice or more times of k, the average length of fingerprint hash block list is less than 1.
In order to save the complexity of the management of memory space and the reduction table, in another embodiment, all differences The fingerprint of length can be multiplexed into a fingerprint database.Fingerprint length 262 can be further reduced false positive matching and fingerprint The probability of collision.
Searching data structure as shown in figures 2 a-c can be implemented with several different modes.Particular implementation can be according to spy Levy the size of code table, the size and type of free memory, depending on piece SRAM, on piece DRAM, the outer SRAM of piece and the outer DRAM of piece. For example, in one embodiment, if feature number of codes is 128K, the effective marker 202 can be stored in an on piece SRAM is to realize quick access, and Hash block chain table pointer 204 is then stored in the outer SRAM of a piece.Effective marker 202 can be used All hashed values access, and Hash block chain table pointer 204 can only use hashed value0AIt accesses.Only when the effective marker of all hashed values 202 all for it is effective when, need to just access Hash block chain table pointer 204.Last block 252 and next block pointer 254 can be stored in one In the outer SRAM of piece, hashed value0B260 and fingerprint length 262 can be stored in the outer SRAM or DRAM of a piece, type 264, feature Code group pointer 266 or condition code pointer 268 and dislocation 270 can be stored in the outer DRAM of a piece.Only work as hashed value0B260 Hes When fingerprint length 262 matches, access type 264, feature code group pointer 266 or condition code pointer 268 and dislocation 270 are just needed.
In one embodiment, in order to improve scan efficiency, the fingerprint of a character string condition code can resolve into multiple Fingerprint section.First these fingerprint sections can concurrently or sequentially be scanned, then use fingerprint synthesizer 150 again by matched fingerprint Section synthesizes the scanning result of fingerprint.Because the number of different fingerprint segment length is generally much smaller than the number of different fingerprint length, Fingerprint is divided into fingerprint section can accelerate finger scan to scan.Fingerprint is divided into fingerprint section to scan, makes support is longer to contain The fingerprint of more fingerprint sections is possibly realized, so as to reduce false positive matching.
In one embodiment, the synthesis of fingerprint section is accurate or complete, non-false positive matching.In another implementation In example, in order to accelerate finger scan, the synthesis of fingerprint section is " coarse " or partial, there is false positive matching.In order to reduce vacation The information of the probability of positive match, the possible position of one or more of each fingerprint section and the possibility length of fingerprint can be stored up It deposits and be used to fingerprint section concurrently or sequentially synthesizing any fingerprint matching.
When multiple fingerprint section serial scans, Hash-entry block fingerprint corresponding with its of " at least one fingerprint matching " closes The different implementations grown up to be a useful person are for example shown in Fig. 2 D-E.In one embodiment, it is reported as long as at least one fingerprint matching One matching, but the information of the length in relation to how many fingerprint matching and matched fingerprint cannot be provided.
As shown in Figure 2 D, in one embodiment, each of Hash-entry block 257 includes a hash of particular fingerprint section Value0B260, a fingerprint section Figure 27 2, an at least fingerprint composite signal 274, a type 264, a feature code group refer to Needle 266 or condition code pointer 268 and a dislocation 270.Fingerprint section Figure 27 2 is an effective marker bitmap array, bit Digit is identical with the number of fingerprint section that the fingerprint synthesizer 151 is supported.When i-th of fingerprint of Duan Weiyi fingerprint of fingerprint Section, the i-th bit of fingerprint section Figure 27 2 will be set.It is all in all fingerprints that fingerprint section Figure 27 2 provides the fingerprint section Possible position.At least a fingerprint composite signal 274 provides the number of minimum steps of the fingerprint containing the fingerprint section, at least one The matched fingerprint synthesis of fingerprint section.In one embodiment, an at least fingerprint composite signal 274 is stored in the of the fingerprint In one fingerprint section.In another embodiment, an at least fingerprint composite signal 274 or any other fingerprint length information are deposited Storage is in each fingerprint section of the fingerprint.In another embodiment, at least a fingerprint composite signal 274 is omitted.
In one embodiment, hashed value0BHashed value in 260 and Fig. 2 C0B260 is identical.Type 264, feature code group refer to Needle 266 or condition code pointer 268 and dislocation 270 are also identical with the respective value in Fig. 2 C, are stored in first fingerprint of fingerprint Duan Zhong, and used after " fingerprint hop count subtracts 1 " a clock cycle is postponed.In one embodiment, due to type 264, condition code Group's pointer 266 or condition code pointer 268 and dislocation 270 are only stored in first fingerprint section of fingerprint, all first fingerprints The fingerprint of Duan Xiangtong is stored in together.In one embodiment, in the selection course of fingerprint, to allow has identical first The probability of multiple fingerprints of fingerprint section is minimum.
Fig. 2 E demonstrate one when at least a fingerprint composite signal 274 is only stored in first fingerprint section of each fingerprint One embodiment of fingerprint synthesizer 151.In one embodiment, the size of fingerprint section is selected as identical with scanning step.For example, The size and scanning step of fingerprint section are all 4, so that the length of fingerprint is 4,8,12 and 16.The fingerprint synthesizer 151 Including 12 d type flip flops, 280,32 inputs and door 282 and 14 multiple selector 284 inputted.Fingerprint synthesizer 151 Input be 2 and at least one fingerprint composite signals 274 of fingerprint section Figure 27.If finding a fingerprint, fingerprint synthesizer 151 will One matching 290 of output.The matching 290 is only when first fingerprint section of a synthesis fingerprint is first finger of a fingerprint It is effective during line section.
It in one embodiment, can be 5 defeated by one in order to verify the multiple selector 284 of 290,4 input of matching The multiple selector substitution entered, wherein the input terminal newly added is connected to zero and is chosen when fingerprint section is not first fingerprint section. In one embodiment, by increasing logic gate in the different time delay stages, fingerprint synthesizer 151 can be extended, to support one The fingerprint length information of the fingerprint section in addition to the first fingerprint section of a fingerprint.In another embodiment, when an at least fingerprint When composite signal 274 is not stored in any fingerprint section, fingerprint synthesizer 151 can be by deleting multiple selector 284 Simplify with the logic gates of all shortest fingerprints being not used in condition code database.In another embodiment, fingerprint Synthesizer 151 can be easily revised as with other scanning steps, the number of fingerprint section and fingerprint length.In an implementation In example, because multiple fingerprint matchings are merged into the matching of one " at least one fingerprint matching ", fingerprint synthesizer 151 can be Each clock cycle exports the result of a finger scan.
When multiple fingerprint section serial scans, the Hash-entry block for providing all matched " all fingerprint matchings " is right with its Another implementation for the fingerprint synthesizer answered is for example shown in Fig. 2 F-G.When the fingerprint of multiple and different length is as a condition code During with being matched, it may be necessary to which the stage is scanned after one or more.
As shown in Figure 2 F, each single item of the Hash-entry block 258 includes a hashed value of particular fingerprint section0B260, one A fingerprint section Figure 27 2, all fingerprint composite signal 276, a type 264, a feature code group pointer 266 or feature Code pointer 268 and a dislocation 270.All fingerprint composite signals 276 are all matched fingerprint composite signals, digit and The number of fingerprint section that fingerprint synthesizer 152 is supported is identical.When a fingerprint section is have i fingerprint section one section of fingerprint When, the i-th bit of all fingerprint composite signals 276 will be set.Corresponding domain phase in the other domains and Fig. 2 D of Hash-entry block 258 Together.In one embodiment, all fingerprint composite signals 276 are merely stored in first fingerprint section of fingerprint.Another In a embodiment, all fingerprint composite signals 276 or any other fingerprint length information are stored in each of fingerprint In a fingerprint section.In another embodiment, all fingerprint composite signals 276 are omitted.
Fig. 2 G demonstrate one when all fingerprint composite signals 276 are merely stored in first fingerprint section of each fingerprint There is one embodiment of the fingerprint synthesizer 152 of multiple fingerprint length.The fingerprint synthesizer 152 scanning identical with Fig. 2 E Step-length, fingerprint section size and fingerprint length.The fingerprint synthesizer 152 includes 16 d type flip flops, 280,52 inputs and door 282 and 13 input with door 286.Described one matching 0292 of the output of fingerprint synthesizer 152, a matching 1294, one With 2296 and one matching 3298, respectively with one, two, the matching of the fingerprint of three and four fingerprint segment lengths.
In one embodiment, by increase in different time delay stage logic gate (as with door) or increase existing and door Input, the fingerprint synthesizer 152 can be extended, to support the attached of the fingerprint section in addition to the first fingerprint section of a fingerprint Adding fingerprint length information.In another embodiment, when all fingerprint composite signals 276 are not stored in any fingerprint section When, it can be deleted for input of all fingerprint composite signals 276 with door and all logic gates related with fingerprint length, With the simplification fingerprint synthesizer 152.In one embodiment, the Hash-entry block 258 in Fig. 2 F can be extended, including more Set type 264, feature code group pointer 266 or condition code pointer 268 and dislocation 270;Every group of domain is used for each fingerprint length.This Outside, in one embodiment, the information of the exact length in relation to fingerprint can also be used for the follow up scan stage.
Fingerprint synthesizer 152 can scan the fingerprint of all different lengths simultaneously.But due to type 264, feature code group Pointer 266 or condition code pointer 268 and dislocation 270 are merely stored in a fingerprint section, multiple fingerprints for sharing the fingerprint section It is stored together.The appropriate selection of the fingerprint of condition code can be preferably minimized the influence.However, in order to eliminate the shadow It rings, the type 264 of the fingerprint of all matched fingerprint sections, feature code group pointer 266 or condition code pointer 268 and dislocation 270 can To be stored in another by the table of matched fingerprint section meaning.All addresses to matched fingerprint section can be used for Search an entry in the table.
In another embodiment, in addition to an at least fingerprint composite signal 274 or all fingerprint composite signals 276, refer to Line section Figure 27 2 can be further omitted, and be stored in any fingerprint section from without any fingerprint composite signal.Fingerprint Section can be synthesized according to the possible form of all fingerprints in described document information database.If multiple fingerprint sections meet all Any form of the form of fingerprint is considered as a fingerprint matching.For a kind of special circumstances, if multiple fingerprint sections expire The minimum requirements of all fingerprint formats of foot, is considered as a fingerprint matching.
When multiple fingerprint section parallel scans, the Hash-entry of " an at least fingerprint fingerprint matching " and " all fingerprint matchings " Block implementation different with one of its corresponding fingerprint synthesizer is for example shown in Fig. 2 H-I.As illustrated in figure 2h, the Hash-entry block Each of 259 includes a hashed value0B260, a fingerprint section matching 278, a type 264, a feature code group pointer 266 Or condition code pointer 268 and one dislocation 270.The fingerprint section matching 278 is the match bit of fingerprint section, other domains and figure Corresponding domain is identical in 2D.The fingerprint section matches some spy of 278 pairs of parallel fingerprint synthesizers as shown in figure 2i Fixed section adaptation has specific meaning (such as section adaptation 0, section adaptation 1, section adaptation 2 or section adaptation 3).Work as fingerprint Section is i-th of fingerprint section of any fingerprint, and the fingerprint section matching 278 of i-th section of adaptation will be set.In one embodiment In, i-th section of adaptation only stores i-th of fingerprint section of fingerprint, so as to which fingerprint section matching 278 is always set, it is convenient to omit. In one embodiment, in order to further reduce false positive synthesis, a similar at least fingerprint composite signal 274 or all fingerprints synthesize The fingerprint length information of information 276 can be stored in first fingerprint section of a fingerprint or all fingerprint sections.
Fig. 2 I demonstrate the one embodiment without the parallel fingerprint synthesizer 153 of fingerprint length information.The fingerprint Synthesizer 153 include one 2 input with door 282, one 3 input with door 286, one 4 input with door 288 and one 4 Input or door 285.The fingerprint synthesizer 153 exports a matching 0292 for all fingerprint matchings, and one matches 1294, one It is a matching 2296 and one matching 3298 and be an at least fingerprint matching export one match 290.In one embodiment, one A totality fingerprint length filters can use matching 0292, matching 1294, matching 2296, matching 3298 and matching 290, with Filter out impossible fingerprint length.In one embodiment, by increasing logic gate, the fingerprint synthesizer 153 can be by Extension, to support the fingerprint length information stored in the first fingerprint section or all fingerprint sections of each fingerprint.In a reality It applies in example, the long scan step-length that parallel fingerprint section scanning can be equal to fingerprint segment length several times with one, to accelerate sweep speed.
In general, in one embodiment, each fixed length condition code can be broken down into multiple segments to carry out independence It scans on ground.As scan with synthesis fingerprint section as, all segments of a condition code can be first scanned, then again by serially or It is parallel to close.In general, the number of different characteristic code segment length, also more much smaller than the number of different characteristic code length even if not being one, So as to accelerate the speed for the scan method for needing individually to be scanned for each modal length.In one embodiment, The identical condition code section of two or more length is by parallel scan.
In another embodiment, one or more of one or more hash tables and one or more Bloom filters It is used to scan through all referring to bright condition code section.In one embodiment, when at least one condition code matches, the fingerprint Synthesizer 150 can be used to that identified condition code section is serially or parallelly synthesized any condition code matching.Shown in Fig. 2 D-I Data structure and embodiment can be used for composite character code section.In another embodiment, one or more finite automatas (FA) and one or more of one or more content adressable memory (CAM) is used to matched condition code section to synthesize It is matched for any condition code.
Fixed length signature scan
As shown in Figure 1, the fixed length condition code Lookup engine 160 includes a condition code search engine 162, a condition code Verify device 164 and a fixed length condition code database 166.Condition code search engine 162 can identify a possible fixed length feature Code, the fixed length feature subcode of possible random length condition code or one include multiple possible fixed length condition codes or fixed length The condition code family of feature subcode.The possible fixed length condition code or feature subcode identified by condition code search engine 162 will be by Condition code is verified device 164 and is verified.Fixed length condition code database 166 is then characterized yard search engine 162 and condition code verifies device 164 Database.
Fixed length condition code database 166 can be implemented with plurality of data structures.In one embodiment, such as Fig. 3 A-B institutes Show, fixed length condition code database 166 is one and multiple condition code chained lists 350 are chained up with constitutive characteristic code group chained list 300 Two-dimensional chain table.The each single item of feature code group chained list 300 includes a next yard of pointer 302, a dislocation 304 and a spy Levy code pointer 306.Next yard of pointer 302 is the pointer of the next item down for being directed toward feature code group chained list 300, condition code pointer 306 be a pointer for being directed toward special characteristic code chained list 350.Dislocation 304 is to refer to from the first unit of particular fingerprint to condition code The dislocation of the first unit of the signified specific character string condition code of needle 306.
Each fixed length condition code or feature subcode can be broken down into multiple condition code sections 352, to form condition code chained list 350.Described document information section 352 can be suitable by scanning since the first basic unit of a fixed length condition code or feature subcode Sequence connects.In one embodiment, the size of described document information section is different.In another embodiment, all spies The size for levying code section is identical, and size can most preferably be selected according to system architecture.The each single item of condition code chained list 350 includes one A condition code section 352, a condition code section film 354, a latter end 356, a next segment pointer 358,360 and of type One number 362.Next segment pointer 358 is a pointer for being directed toward next section, and latter end 356 is latter end mark.When type 360 is When 0, number 362 is fixed length feature subcode number 364;Otherwise, number 362 is characterized code number 366.Condition code section film 354 is used In the specific matching condition for specifying each basic unit or even sub- basic unit, including " not having to matching ", " equal ", " not phase Deng ", " within the scope of one ", " outside a range ", " case-insensitive " and " case sensitive ".The matching condition It can be realized by the input source and output source of selecting unit comparator.If a fixed length condition code or feature subcode are not The integral multiple of the size of condition code section 352 can be filled out in the tail portion of the fixed length condition code or feature subcode with for up to " feature The size of code section 352 subtracts one " a " 0 " or other values, and the film of corresponding fills unit is set as " not having to matching ".
In one embodiment, the condition code section film 354 of each basic unit is 3 bits." case-insensitive " When, the first bit is set as 0;When " case sensitive ", the first bit is set as 1.Latter two bit is set when " equal " 0 is set to, latter two bit is set as 1 when " unequal ", latter two bit is set as 2 when " not having to matching ", latter two ratio Special position is retained for 3.More film bits can be used for realizing other matching conditions as needed, for example, pre-defined Range (numerical character or alphabetic character), symbol class or any range.In another embodiment, in order to improve storage efficiency, The code film of one or more features code section or fixed length condition code or feature subcode can be used alone or with basic unit or subunit Used in the code film cooperation of this unit.
In one embodiment, described document information search engine 162 can search feature code group chained list 300 until the chained list Tail item, i.e. next yard of pointer 302 is null pointer.The each single item of feature code group chained list 300 all returns to one feature of a direction The condition code pointer 306 of code chained list 350.
Described document information, which verifies device 164, can be verified as each progress of condition code chained list 350 condition code verification.Condition code core Real device 164 is since the first segment of condition code chained list 350, by each condition code section of scanning sequency paragraph by paragraph examination.If mismatched, Condition code, which verifies device 164, will stop hypomere inspection;Otherwise, condition code verify device 164 will check entire condition code chained list 350 until Its tail, i.e. latter end 356 are 1.If finding a matching, when type is 0, the fixed length subcode for random length condition code is matched, Condition code, which verifies device 164, will export a fixed length feature subcode number 364;When type is 1, match as fixed length condition code, spy Sign code, which verifies device 164, will export a fixed length condition code number 366.
In another embodiment, in order to which false positive matching is made to drop to zero with the rapid development of hop count, condition code section 352 It can be linked together by an optimal sequence.The dislocation for representing current signature code section to next condition code section will be added to condition code chain In each single item of table 350.Although the length of condition code section can differ, the length of a fixed condition code section can be chosen It selects.
Fig. 4 A-B demonstrate an implementation of a condition code unit comparator 400 and a condition code section comparator 450 The block diagram of example.Described document information unit comparator 400 carries out unit comparison in the lookup of fixed length condition code.Condition code unit ratio Include a code film decoder 402 compared with device 400, one 2 input multiple selector 404, an equality comparator 406, two non- Door 408, one 4 input multiple selector 410, one 2 input or door 412 and a range comparator 414.Code film decoder 402 can be decoded as code film bit equality comparator 406 and the control signal output and input of range comparator 414.One In a embodiment, range comparator 414 is optionally used to support the global scope or each of predefined each String field The global scope of condition code.In one embodiment, m range comparator 414 can be used for supporting m predefined global models It encloses.In another embodiment, units match 416 can be with " not having to match " bit logic or being exported later.
Multiple condition code unit comparators 400 and a multi input can be used to construction feature code section comparator with door 452 450.The data cell of condition code section comparator 450 is typically a byte, but is alternatively four bits or any other size. In another embodiment, condition code unit comparator 400 can be by a condition code unit comparator 480 as shown in Figure 4 C Substitution, with supported feature code unit local scope.The unit of each tape code film in described document information section 352 can be extended to The unit range of one tape code film or the upper bound of given condition code unit of tape code film and the unit pair of lower bound.
In one embodiment, fixed length condition code Lookup engine 160 shown in figure 1A may require to look up multiple condition codes Chained list.However, it is necessary to the probability for searching multiple condition code chained lists is usually very low.When requiring to look up multiple condition code chained lists, The differential encoding that multiple condition codes in a group condition code are mutually encoded can be used.
For example, in one embodiment, as shown in Figure 5A select string elements tree 500 can be as shown in figure 1A The searching data structure of condition code search engine 162.String elements tree 500 is selected to include burl 520a-520e.Select character string There are two branches for each burl 520 of cell tree 500, and a matching branch is by pointer1530 is signified, another mismatches branch By pointer2532 is signified.As shown in Figure 5 B, there are two types of different burls 520 for selection string elements tree 500:Type 528 is 1 Leaf and the non-leaf that type 528 is 0.Matched non-leaf always points towards another burl 520, and unmatched non-leaf is directed toward tree On another burl 520 or empty section.Matched leaf always points towards a condition code family chained list 550 as shown in Figure 5 C, no Matched leaf is directed toward another burl 520 or empty section on tree.
In one embodiment, as shown in Figure 5 B, each burl 520 of selection string elements tree 500 includes a mistake Position 522, a unit 524, a unit membrane 526, a type 528, a pointer1530 and a pointer2532.Type 528 include leaf as described above and non-leaf.Selected unit 524 can be in any position of condition code, when burl 520 is not During tree root, position provided by the dislocation 522 of previous burl 520 (for example, the front nodal point of node 520b is node 520a, but It is because node 520a is the tree root for selecting string elements tree 500,520a does not have front nodal point), when burl 520 is tree root When (for example, node 520a), position is provided by the dislocation 270 of the occurrence in fingerprint hash block chained list 250 (see Fig. 2 B).Often A burl include one with the condition code section film 354 in condition code chained list 350 (see Fig. 3 B) corresponding unit membrane 526.
At least one basic unit of any two character string condition code is different, as long as one of character string condition code is not It is the subsegment of another character string condition code.Therefore, a unit 524 can be chosen to distinguish at least two character string features Code, so that after the unit 524, at least one character string condition code can be eliminated.By to selection The lookup of string elements tree 500, the different condition code of at least one basic unit can be distinguished.In one embodiment In, shown selects string elements tree 500 as binary tree.In another embodiment, corresponding selection character can be built K points of trees of string location.K basic unit of the k condition code on same basic unit position in one feature code group can be made The burl set for k points, although each condition code can also be contributed in multiple basic units to each burl of k points of trees.
In one embodiment, a character string condition code can be a part for another character string condition code.These The condition code for having mother-child relationship (MCR) cannot be chosen the differentiation of string elements tree 500.In one embodiment, as long as finding therein Any one condition code is just not required to further search for, it is not necessary that distinguishing has the condition code of mother-child relationship (MCR).Therefore, it only needs to scan Wherein shortest condition code.But in another embodiment, it needs to distinguish all condition codes or needs to identify longest spy Levy code.
In order to support the condition code family of mother-child relationship (MCR), in one embodiment, a condition code as shown in Figure 5 C Family's chained list 550 can be characterized the searching data structure that code verifies device 164 (see Figure 1A).Condition code family chained list 550 it is every One includes a type 552, a dislocation 554, a condition code section 556, a condition code section film 558, a next item down Pointer 560, a numbering type 562 and a number 564.In one embodiment, for supported feature code family, feature There are two kind items for code family chained list 550:The result items that type 552 is 0 lookup item and type 552 is 1.In inspection, each is looked into When looking for, condition code section 556 will be compared according to condition code section film 558.Condition code section 556 and condition code section film 558 and feature Corresponding domain in code chained list 350 is identical (see Fig. 3 B).But result items are not required to any condition code section and compare.
In one embodiment, all matched condition codes of the system searching, and export each matched condition code Number 564.However, the lookup of condition code will proceed to the tail portion of condition code family chained list 550 or the next item down pointer 560 always For null pointer, to find all female condition codes.There are two types of numbers 564:Feature subcode number 566 and condition code number 568.It compiles Depending on numbers 564 type is by numbering type 562.When numbering type 562 is 0, number 564 is feature subcode number 566, is represented The fixed length feature subcode of one random length condition code;Otherwise, number 564 is condition code number 568, represents a fixed length feature Code.
In one embodiment, condition code family chained list 550 since a minimus generation or shortest subcode, always It is connected to an oldest generation or longest subcode.Dislocation 554 is specified from the first unit of existing condition code section 556 to lower condition code section The dislocation of 556 first unit.If existing condition code section mismatches, the lookup of condition code family chained list 550 can stop.It is fixed to provide Arbitrary dislocation 554 between feature subcode makes to terminate into possibility in advance based on unmatched.
In one embodiment, every generation of condition code family chained list 550 can only support a condition code.If one is specific , there are multiple condition codes in generation, can use multiple condition code families chained list 550, each condition code in the specific generation needs a spy Zheng Ma families chained list 550.Multiple condition code families chained list 550 can be distinguished with selection string elements tree 500.
In another embodiment, the condition code of tape code film can use one or more memories content addressed including one Memory (CAM) is stored and is scanned.
Random length signature scan
As shown in Figure 1, the fixed length condition code Lookup engine 160 exports all matched random length features by scanning sequency The number of the fixed length feature subcode of code, size, the position in the String field.With identified fixed length feature subcode Identified fixed length feature subcode is synthesized any random length condition code by information, random length condition code Lookup engine 180.One In a embodiment, one or more finite automatas (FA) are used to synthesis fixed length feature subcode.In another embodiment, no Fixed length condition code Lookup engine 180 include a condition code rule searching device 182, a condition code state verification device 184, one Condition code composition rule database 186 and a condition code state table 188.Condition code rule database 186, which provides, how will The fixed length feature subcode of random length condition code synthesizes the static rule of random length condition code.188 dynamic of condition code state table State of a process of the ground for an input String field storage synthesis.
Condition code rule searching device 182 is found out and matched fixed length feature from condition code rule database 186 Code numbers relevant rule, and the relevant condition code rule is supplied to condition code state verification device 184.Condition code state Matched fixed length feature subcode is synthesized random length condition code, and update spy by validator 184 according to condition code composition rule Levy code state table 188.
Condition code rule database 186 and condition code state table 188 can have a variety of different data structures.In a reality It applies in example, condition code rule database 186 can be condition code regulation linked 600 as shown in Figure 6.Condition code regulation linked Pointed by 600 numbers of feature subcode found by fixed length condition code Lookup engine 160.Multiple random length condition codes may include phase Same fixed length feature subcode.Condition code regulation linked 600 can be special by all fixed length representative comprising special characteristic subcode number The random length condition code of sign subcode links together.
In one embodiment, each single item of condition code regulation linked 600 corresponds to a random length condition code.Each single item packet Containing a condition code number 602, a sequence 604, a last code 606, a next yard of pointer 608 and a distance range Information 610.Condition code number 602 represents specific random length condition code.Sequentially 604 is determine representated by feature subcode number Sequence of the long feature subcode in all fixed length feature subcodes of the random length condition code.Last code 606 indicates that the fixed length is special Sign subcode whether be random length condition code the last one fixed length feature subcode.Last code 606 indicates random length signature scan Terminate.Next yard of pointer 608 is directed toward 600 the next item down.Distance range information 610 is indicates the fixed length feature subcode under The optional domain of the distance range (i.e. minimum and most unit number between two fixed length feature subcodes) of fixed length feature subcode.For example, When the range or longest distance or the shortest distance are previously given or when being infinitely great, distance range information 610 can be saved Omit or be reduced to the shortest distance or longest distance.
In another embodiment, each single item of condition code regulation linked 600 can include one or more additional fields, with The one or more random length feature subcodes between two or more fixed length feature subcodes of description.For example, " a mould can be used One " pattern " of repetition that one is used for filling up distance range information 610 is added to feature by formula " or one " pointer of pattern " In each single item of code regulation linked 600, to describe random length feature subcode.
As shown in fig. 7, in one embodiment, condition code state table 188 can use one or more features code state Chained list 700 is implemented.Each condition code state-chain-table 700 can be a String field of particular link, and dynamic memory is The condition code state of all fixed length feature subcodes of all random length condition codes of identification.Condition code state-chain-table 700 it is each Item includes a condition code number 702, previous subcode sequence (lorder) 704, a next subcode position 706 (nloc) and a next yard of pointer 708 (nptr).Condition code number 702 is the number of specific random length condition code.Previous son Code sequence 704 be the specific random length condition code previous fixed length feature subcode number to specific fixed length feature son The sequence of code.Next subcode position 706 be next fixed length feature subcode of the specific random length condition code number to A fixed length feature subcode active position range.
In one embodiment, there are one condition code state-chain-tables 700 for each String field of each line.In general, At each moment, each line is only scanned there are one String field, and only there are one condition code state-chain-tables 700.Condition code State-chain-table 700 may include all matched subcodes of all random length condition codes of a String field of particular link Entire effective history.
Condition code state-chain-table 700 can be dynamic.In one embodiment, if the volume of the fixed length feature subcode The first fixed length feature subcode of number signified fixed length feature subcode for a random length condition code, i.e. its sequence 604 are 1, and There are no the item of the specific random length condition code, a new item can be added into the spy of a String field of particular link It levies in code state-chain-table 700.If the position of first unit of existing matched fixed length feature subcode is not in next subcode bits Put 706 to effective range in or time-out occur, one of described document information state-chain-table 700 can be deleted.It is described One of condition code state-chain-table 700 can also be deleted after based on the item matched random length condition code is found It removes.In one embodiment, all items of the condition code state-chain-table 700 of a String field of particular link can be one The end of a String field is deleted.
As shown in figures 1 to 6, in one embodiment, described document information rule searching device 182 is from the fixed length condition code Lookup engine 160 receives the number of a fixed length feature subcode.Described document information rule searching device 182 searches the entire fixed length The condition code regulation linked 600 of the number meaning of feature subcode, and each single item of described document information regulation linked 600 is given Information (such as { condition code number 602, sequence 604, last code 606, distance range information 610 }) and fixed length condition code lookup are drawn Hold up 160 to information (as the position of the first unit of fixed length feature subcode, the length of fixed length feature subcode, line number, String field is numbered }) it is sent to successively together in condition code state verification device 184, reach up to described document information rule chain The tail portion of table 600 (i.e. next yard of pointer 608 is null pointer).
Described document information state verification device 184 is looked into the information of each single item that described document information rule searching device 182 is given The condition code state-chain-table 700 for looking for the line number signified.To each single item of condition code state-chain-table 700, if condition code Number 602 and condition code number 702 are identical, and sequence 604 is equal to previous subcode sequence 704 plus one, and the of fixed length feature subcode As soon as in the range of the effective marker that the position of basic unit is given in next subcode position 706 is put, for a matching.To feature Each occurrence of code state-chain-table 700, if last code 606 is 1, output described document information number 602, and by the Xiang Congte It is deleted in sign code state-chain-table 700;Otherwise, the item is updated, and the value 704 of previous subcode sequence is allowed to be equal to the value of sequence 604, The value of next subcode position 706 be equal to fixed length feature subcode first unit position, the length of fixed length feature subcode and away from The summation of value from range information 610.It is not required to do anything to unmatched item.
In one embodiment, when condition code state-chain-table 700 is short and particular link only needs scanning one in particular moment String field can be scanned with described document information state-chain-table 700.But if condition code state-chain-table 700 is grown or spy Multiple String fields need to be scanned in particular moment by determining line, and other searching data structures can be used for described document information state table 188.In one embodiment, described document information state table 188 can be analogous to one of the data structure in Fig. 2A-C Condition code state Bloom filter or a condition code state hash table.Condition code state Bloom filter or condition code state dissipate The hashed key of list can be 3 tuples { line is numbered, String field number, condition code number }.In one embodiment, when Each line only has a String field in particular moment and is scanned, and is not required to be numbered with String field.
As shown in figure 8, in one embodiment, the Hash-entry block 256 in Fig. 2A-C can be hashed item 856 and take Generation.The each single item of Hash-entry block 856 includes special sign code number 860, a line number 862, a String field Number 864, a previous subcode sequence 866 and a next subcode position 868.Previous subcode sequence 866 and next subcode bits The definition for putting 868 and previous subcode 704, the hashed keys 3 yuan identical with the definition of next subcode position 706 of sequence in described Fig. 7 Group { condition code number 860, line number 862, String field number 864 } can be stored and be made to solve Hash collision. To each single item of Hash-entry block 856, if former hashed key is identical, sequence 604 is equal to previous subcode sequence 866 plus one, and fixed It it is just one within the scope of the active position that the position of the first unit of long feature subcode is given in next subcode position 868 Match.To each occurrence of Hash-entry block 856, if last code 606 is 1, output described document information number 602, and by described in Item is deleted from Hash-entry block 856;If last code 606 is 0, the item of the Hash-entry block 856 is updated, and allows previous subcode Sequentially 866 are equal to sequence 604, and next subcode position 868 is equal to the position of the first unit of fixed length feature subcode, fixed length feature The length of code and the summation of distance range information 610.Xiang Ze is mismatched to remain unchanged.
In one embodiment, when the sequence of two continuous fixed length feature subcodes and distance range meet condition code rule It is one or more in chained list 600, condition code state verification device 184 will further verify whether described two fixed length feature subcodes Between string segments and condition code regulation linked 600 one or more meanings one or more random length feature subcodes Match.When the character string in the specific character string field between described two continuous fixed length feature subcodes and the feature During the random length feature subcode matching of one or more meanings of code regulation linked 600, described document information state-chain-table 700 It will be updated with the new fixed length feature subcode.
The design and performance of scanning system
In one embodiment, the speed of quick character string signature scan engine 100 is by the speed of finger scan engine 140 Degree limitation, for example, when the false positive matches the sufficiently small and follow up scan stage and properly designed.When fingerprint is whole as one When body is scanned by length one by one, the speed of finger scan engine 140 depends on scanning step, the number and clock of fingerprint length Velocity composition.In one embodiment, the speed of the scanning engine 100 is (s/m) * R, and wherein s is scanning step, and m is refers to The number and R of line length are clock speed.If for example, scanning step be 8 bytes, 4,8,16 and 32 byte of fingerprint length, when Clock frequency is 500MHz, and the speed of the scanning engine 100 is 8* (8/4) * 500MB/s=8Gbits/s.
In another embodiment, if fingerprint is first segmented carry out parallel scan, then " an at least fingerprint matching " is used again Serial synthesis, and fingerprint segment length is identical with scanning step, the speed of an individual scanning engine 100 is s*R.In a reality It applies in example, when s and R takes the value identical with precedent, the scanning engine 100 can be with one character string of 32Gbps velocity scannings Field.In addition, in another embodiment, fingerprint can first be segmented and be scanned, then parallel synthesis again, scanning step is so as to sweep Can be further improved by retouching speed by n times, and wherein n is the number of the fingerprint section of parallel scan and synthesis.When fingerprint segment length and R take During identical with precedent value, if n is 4, i.e. for 4 fingerprint sections by parallel scan and synthesis, scanning step is 32 bytes, described to sweep Retouching engine 100 can be with one String field of 128Gbps velocity scannings.
Sweep speed discussed above is the speed of a single signature scan engine.In one embodiment, when When carrying out parallel scan with several signature scan machines, signature scan speed can further improve several times.
In one embodiment, the selection of the overall structure and parameter of a signature scan system can according to one or Multiple following factors:The fixed length feature subcode of the sweep speed of character string condition code, fixed length condition code or random length condition code Size, the size of similitude and condition code database between multiple condition codes or feature subcode select, to ensure that fingerprint is swept Engine 140 is retouched, fixed length condition code Lookup engine 160 and random length condition code Lookup engine 180 can meet specific scanning system Requirement.For example, scanning step can be selected according to system requirements.As shown in table 1, scanning step is longer, quick character string The speed of signature scan engine 100 is faster.But the minimum dimension of fixed length condition code or feature subcode just need it is bigger, with And the number for being inserted into and deleting is more.The step-length that exposes thoroughly also limits the selection of the fingerprint of each condition code, and increases fingerprint and touch It hits and the matched probability of fingerprint false positive.
Table 1:The selection of scanning step
In addition, scanning step can especially be limited so as to sweep speed by the size of minimum fixed length condition code or feature subcode. In one embodiment, in order to avoid short condition code is separately scanned, scanning step can be by table 1 according to fixed length condition code or spy The minimum dimension of subcode is levied to select.
Table 1 assumes that each basic unit of all fixed length condition codes and feature subcode can be with for fingerprint.In a reality It applies in example, all fixed length condition codes and feature subcode at least indicate completely in a shadow space.In addition, at another In embodiment, scanning step is so as to which sweep speed is further by the shadow indicated completely of all fixed length condition codes and feature subcode Minimum dimension limitation.Therefore, can make into the third column heading of table 1 " all fixed length condition code or feature subcode it is complete All referring to the minimum dimension of bright shadow ".
In another embodiment, in order to improve sweep speed, larger scanning step can be selected.Shorter than described scanning The scannable condition code of step-length can be scanned separately, for example, with above-mentioned scan method or any other scan method.When only few Number fixed length condition code or feature subcode are shorter, and it is very effective to increase scanning step.
In another embodiment, the engine number of different scanning flow line stage can be different.Engine can be according to institute The requirement of particular system is stated to select.For example, for particular system, a scanning pre-processing engine 120, four fingerprints can be used Scanning engine 140, a fixed length condition code Lookup engine 160 and two random length condition code Lookup engines 180.
In one embodiment, multiple finger scan engines 140 can be used, so that each finger scan engine 140 One group of fingerprint length is covered, to provide multiresolution finger scan.In one embodiment, all fingerprints are all broken down into phase It is scanned with the fingerprint section of length and the identical scanning steps of all fingerprint Duan Douyong.For scanning all fingerprint length groups The number of the finger scan engine of every group of fingerprint length can be all identical.
In another embodiment, fingerprint is broken down into the fingerprint section of different length, and the fingerprint section of different length is according to one The average length of group fingerprint length is scanned with the scanning step of different length, so that the finger that one group of average length is longer Fingerprint longer fingerprint section and the larger scanning step of line length, the fingerprint of the shorter fingerprint length of one group of average length are used Shorter fingerprint section and smaller scanning step.For example, the fingerprint section and scanning step of 8 basic units can be used for length is The fingerprint of 8,16,24,32 and 40 basic units, and the fingerprint section of 2 basic units and scanning step can be used for length is The fingerprint of 2,4 and 6 basic units.
In one embodiment, drawn to balance with the finger scan of the fingerprint section of different scanning step scan different length 140 speed is held up, the number of the finger scan engine 140 of the shorter fingerprint section of the shorter scanning step scanning of use can be more than is swept with longer Retouch the number of the finger scan engine 140 of the longer fingerprint section of step scan.In another embodiment, it is not synchronized in order to balance use The speed of the finger scan engine 140 of the memory of degree, the number of the finger scan engine 140 of the slower memory of use can be more than with very fast The number of the finger scan engine 140 of memory.In general, in another embodiment, the number of finger scan engine 140 can Depending on by the product of scanning step and memory speed.The finger scan engine of the product of smaller scanning step and memory speed 140 number can be more than the number of the finger scan engine 140 of the product of larger scanning step and memory speed.
In one embodiment, the finger scan engine 140 of the identical same group of fingerprint of scanning of multiple scanning steps covers Multiple nonoverlapping positions staggeredly in one input String field, so that the multiple finger scan engine is always swept Retouch the number and the product of the scanning step of former single finger scan engine that step-length is finger scan engine.For example, in order to provide Identical sweep speed, it can be scanning step be 8 bases that scanning step, which is the number of the finger scan engine of 2 basic units, 4 times of the number of the finger scan engine of this unit.In another embodiment, same group of the identical scanning of multiple scanning steps The position staggeredly of multiple portions overlapping in the one input String field of the covering of finger scan engine 140 of fingerprint, so as to make The total scanning step for obtaining the multiple finger scan engine is more than the scanning step of former single finger scan engine, but is swept than fingerprint The product for retouching the number of engine and the scanning step of former single finger scan engine is small.
In one embodiment, the fingerprint database 148 of different fingerprint segment length is storable in the memory of friction speed, So that the memory for shorter fingerprint section is faster than the memory for being used for longer fingerprint section.In one embodiment, it is different Fixed length condition code database 166 corresponding to fingerprint length group is storable in the memory of friction speed, so that for one The memory of the shorter fingerprint of group average length is faster than the memory for being used for one group of longer fingerprint of average length.
In one embodiment, the fingerprint database 148 of the shorter than fingerprint of specific length (for example, 9 basic units) can It is stored in one of most fast memory (for example, caching of on-chip memory or CPU) of scanning system.In one embodiment, The all or part of fixed length condition code database 166 corresponding to the fingerprint of shorter than described specific length also can and fingerprint database 148 are collectively stored in one of most fast memory of scanning system.In another embodiment, multiple fingerprints of same fingerprint group Scanning engine can share a fingerprint database being stored in the memory of a multiport 148.
In one embodiment, one or more engines discussed above in different flow line stages can be any Other scan methods are replaced.For example, in one embodiment, a content adressable memory (CAM) can be used as finger scan Engine 140 is used to the shadow of scanning fingerprint, and fixed length condition code Lookup engine 160 and random length condition code Lookup engine 180 It still can be used to further scanning feature code.In another embodiment, CAM can be used as finger scan engine 140 by with Come in former spacescan one or more fingerprint.In one embodiment, one determine or non-deterministic finite automaton (DFA or NFA) synthesis fingerprint section can be used to as fingerprint synthesizer 150.In another embodiment, a DFA or NFA can make A fixed length feature subcode, which is used to, for random length condition code Lookup engine 180 synthesizes random length condition code.
Other embodiments of the invention can scan other string datas.For example, in biosystem, a hereditary generation Code sequence can be used as a String field.Describing the condition code of specific gene can be used for forming from one by genetic data String field in identify the specific gene sequence.For example, specific gene can be with the scanning machine by specific special Code is levied to identify.
The present invention and the described all feature operations of this specification, including the described structural approach of this specification or phase When structural approach, or both combination, can use and Fundamental Digital Circuit or be implemented with computer software, firmware or hardware. The present invention can be implemented on one or more computer program products, i.e., one or more be stored in one such as one it is machine readable Storage device or a transmitting signal information carrier in, a such as programmable processor, one or multiple stage computers Digital processing device is performed or computer program that it is controlled to operate.One computer program (an also referred to as program, it is soft Part, application software or code) any programming language form for including compiling or interpretative code is can be written as, and can be deployed as Including as a stand-alone program or being used as a module, component, subprogram or other units suitable for computing environment Any form.One computer program might not correspond to a file.One program can be with other programs or data together Deposit in one file, can be a Single-issue program file, can be multiple coordinations file (for example, storage one Or multiple modules, multiple files of subprogram or partial code).One computer program can be deployed to a computer, The multiple stage computers in same place or be distributed in the multiple stage computers interconnected with communication network in multiple places perform.
Processing routine described by this specification and logic flow include the method and step of the present invention, can use and perform one Or one or more programmable processors of multiple computer programs perform, with by operation input data and generate output come Implement the function of the present invention.The logic circuit of specific use, such as field programmable gate array (FPGA) or application-specific integrated circuit (ASIC), it can also be used to perform the processing procedure and logic flow and implement the device of the invention.
The processor for being adapted for carrying out computer program includes, such as the microprocessor of general and special purposes and any Any one or more processors of the digital computer of type.In general, processor will from read-only memory or with Machine access memory or both receives instruction and data.The basic module of one computer is a processing for being used for execute instruction Device and the memory of one or more store instructions and data.In general, a computer is further included or is efficiently couple to One or more is used for storing the mass-memory unit (e.g., disk, magneto-optic disk or CD) of data, to receive, sends or receives Send out data.Include the non-volatility memorizer of all forms suitable for the information carrier of storage computer program instructions and data, Including such as EPROM, EEPROM, the semiconductor memory of quick flashing (flash) memory;Such as internal hard drive or the magnetic of moveable magnetic disc Disk;Magneto-optic disk;With CD-ROM and DVD-ROM CDs etc..Processor and memory can be mended by the logic circuit of specific use Fill or be included into the logic circuit of specific use.
In order to provide the interaction with user, the present invention can have display equipment at one, and a keyboard and a pointer are set Implement on standby computer.Show equipment, such as a cathode-ray tube (CRT) or liquid crystal display (LCD) for user for showing Show information;Keyboard and pointing device such as mouse or trace ball, then help user to provide input directly in computer.It is other types of Equipment can also be used for providing the interaction of user and computer;For example, it can be any form that computer, which is supplied to the feedback of user, Sense feedback, for example, visual feedback, audio feedback or touch feedback;The form that user is input to computer may be to appoint What form, including acoustics, voice or sense of touch.
The present invention can be implemented in a computer system, after one such as one data server End component such as the intermediate module (Middleware) of application server or has comprising one such as one comprising one The client computer of one graphic user interface or by its user can be interactive with the implementation of the present invention network (Web) it is clear Look at device front end assemblies or any this kind of rear end, intermediate or front end assemblies combinations.The component of the system can be by such as Any form of communication network or the digital data communications of medium are connected with each other.The example of communication network includes local network (LAN) With wide area network (WAN), such as internet.
This computing system can include client and server.Client and server is typically remote from other side, usually passes through One communication network interaction.The relationship of client and server result from run on respective computer respectively have client and The calculation procedure of service relation.
Fig. 9 demonstrates an example of such computer, and one can be used for being practiced or carried out the device of the invention Or the block diagram of the programmable processing system (abbreviation system) 910 of method.The system 910 include a processor 920, one A random access memory (RAM) 921, (the writeable of such as one such as flash ROM read-only is deposited for read-only memory 922 Reservoir (ROM)), a hdd controller 923, the control of an image controller 931 and an input/output (I/O) Device 924 processed, and coupled by processor (CPU) bus 925.The system 910 can be programmed such as in ROM, can also be used another One program source (such as floppy disk, CD-ROM or another computer) loads a program so as to be programmed (and being reprogrammed).
Hdd controller 923 and hard disk 930 are coupled available for storing executable computer program.
The controller 924 of input/output is to be connected to an input/output interface by an input/output bus 926 927.Input/output interface 927 is in serial link, LAN, on wireless connection and parallel link etc. communication link, receive and Transmit analog or digital data (such as one group of stage photo, picture, film and animation).
One display 928 and a keyboard 929 are also connected on input/output bus 926.In addition, different line (different buses) can be used for connecting input/output interface 927, display 928 and keyboard 929.
The present invention is described in a manner of specific embodiment.Other embodiments the scope of the appended claims it It is interior.For example, the step of invention, can perform, and remain to reach desired result in differing order.

Claims (10)

1. a kind of character string signature scan method, the method scans one or more features code with larger scanning step, The shorter than described scannable condition code of scanning step is separately scanned with different scanning method.
2. a kind of character string signature scan method, the method scans one or more features code with larger scanning step, The shorter than described scannable condition code of scanning step is separately scanned with same scan method but different scanning step-length.
3. a kind of character string signature scan system, the system comprises scan one or more features with larger scanning step Code, the device that the shorter than described scannable condition code of scanning step is separately scanned with different scanning method.
4. a kind of character string signature scan system, the system comprises scan one or more features with larger scanning step Code, the dress that shorter than described scanning step scannable condition code same scan method but different scanning step-length are separately scanned It puts.
5. a kind of character string signature scan method, the method includes:
Multiple condition codes are divided into multiple feature code characters, wherein every group of condition code includes one or more features code;
Any condition code in a String field is identified, including being looked into multigroup scanning engine scanning String field Multigroup condition code is looked for, wherein every group of scanning engine includes one or more identical scannings for being used to search one set of feature code Engine to cover the position of one or more groups of String fields staggeredly, use by the wherein scanning engine of different scanning engine group The number of the scanning engine of the shorter scanning engine group of different scanning steps and scanning step is longer compared with scanning step to sweep The number for retouching the scanning engine of engine group is more;With
Export any matched condition code in the String field.
6. a kind of character string signature scan system, the system comprises:
One condition code preprocessing module that multiple condition codes can be divided into multiple feature code characters, wherein every group of condition code includes one A or multiple condition codes;With
One signature scan module that can recognize that any condition code in a String field, described document information scan module The String field is scanned to search the scanning engine of multigroup condition code including multigroup, wherein every group of scanning engine packet It includes one or more identical for searching the scanning engine of one set of feature code to cover one or more groups of characters staggeredly The position of string field, the wherein scanning engine of different scanning engine group are swept with different scanning steps and scanning step is shorter The number for retouching the scanning engine of engine group is more compared with the number of the scanning engine of long scanning engine group compared with scanning step.
7. a kind of character string signature scan method, the method includes:
Multiple condition codes are divided into multiple feature code characters, wherein every group of condition code includes one or more features code;
Any condition code in a String field is identified, including being looked into multiple scanning engines scanning String field Look for multigroup condition code, wherein each scanning engine is for searching one set of feature code, wherein different scanning engine at one or The memory of friction speed is used in multiple scanning flow line stages;With
Export any matched condition code in the String field.
8. a kind of character string signature scan system, the system comprises:
One condition code preprocessing module that multiple condition codes can be divided into multiple feature code characters, wherein every group of condition code includes one A or multiple condition codes;With
One signature scan module that can recognize that any condition code in a String field, described document information scan module The String field is scanned to search the scanning engine of multigroup condition code including multiple, wherein each scanning engine is used In searching one set of feature code, wherein different scanning engine is used in one or more scans flow line stage in friction speed It deposits.
9. a kind of character string signature scan method, the method includes:
Multiple condition codes are divided into multiple feature code characters, wherein every group of condition code includes one or more features code;
Any condition code in a String field is identified, including being looked into multigroup scanning engine scanning String field Multigroup condition code is looked for, wherein every group of scanning engine includes one or more identical scannings for being used to search one set of feature code Engine to cover the position of one or more groups of String fields staggeredly, use by the wherein scanning engine of different scanning engine group Different scanning step uses the memory of friction speed in flow line stages in one or more scan;With
Export any matched condition code in the String field.
10. a kind of character string signature scan system, the system comprises:
One condition code preprocessing module that multiple condition codes can be divided into multiple feature code characters, wherein every group of condition code includes one A or multiple condition codes;With
One signature scan module that can recognize that any condition code in a String field, described document information scan module The String field is scanned to search the scanning engine of multigroup condition code including multigroup, wherein every group of scanning engine packet It includes one or more identical for searching the scanning engine of one set of feature code to cover one or more groups of characters staggeredly The position of string field, the wherein scanning engine of different scanning engine group are with different scanning steps or in one or more scanning stream The memory of friction speed is used in last pipeline stages.
CN201711339378.4A 2008-10-20 2008-10-20 Fast signature scan Pending CN108197470A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711339378.4A CN108197470A (en) 2008-10-20 2008-10-20 Fast signature scan

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CN201711339378.4A CN108197470A (en) 2008-10-20 2008-10-20 Fast signature scan
CN201410055830.4A CN103793522B (en) 2008-10-20 2008-10-20 Fast signature scan
CN200880127748.0A CN101960469B (en) 2008-10-20 2008-10-20 Fast signature scan
PCT/US2008/080457 WO2010047683A1 (en) 2008-10-20 2008-10-20 Fast signature scan

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
CN200880127748.0A Division CN101960469B (en) 2008-10-20 2008-10-20 Fast signature scan

Publications (1)

Publication Number Publication Date
CN108197470A true CN108197470A (en) 2018-06-22

Family

ID=42119542

Family Applications (2)

Application Number Title Priority Date Filing Date
CN201711339378.4A Pending CN108197470A (en) 2008-10-20 2008-10-20 Fast signature scan
CN200880127748.0A Active CN101960469B (en) 2008-10-20 2008-10-20 Fast signature scan

Family Applications After (1)

Application Number Title Priority Date Filing Date
CN200880127748.0A Active CN101960469B (en) 2008-10-20 2008-10-20 Fast signature scan

Country Status (2)

Country Link
CN (2) CN108197470A (en)
WO (1) WO2010047683A1 (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104380308A (en) * 2012-05-31 2015-02-25 Opto电子有限公司 Read-in device, read-in result output method, and program
EP3091450B1 (en) * 2015-05-06 2017-04-05 Örjan Vestgöte Method and system for performing binary searches
CN105095367B (en) * 2015-06-26 2018-12-28 北京奇虎科技有限公司 A kind of acquisition method and device of client data
CN112673576B (en) * 2018-09-14 2024-04-19 西门子股份公司 Data compression method, data recovery method and device
US20230104304A1 (en) * 2021-09-28 2023-04-06 Rakuten Mobile, Inc. Logic-gate based non-deterministic finite automata tree structure application apparatus and method

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH04306785A (en) * 1991-04-03 1992-10-29 Mitsubishi Electric Corp Pattern recognition system
CN1459761A (en) * 2002-05-24 2003-12-03 清华大学 Character identification technique based on Gabor filter set
US20050086520A1 (en) * 2003-08-14 2005-04-21 Sarang Dharmapurikar Method and apparatus for detecting predefined signatures in packet payload using bloom filters
CN1625121A (en) * 2003-12-05 2005-06-08 中国科学技术大学 Hierarchical cooperated network virus and malice code recognition method
CN1648901A (en) * 2005-02-03 2005-08-03 中国科学院计算技术研究所 Method and system for large scale keyboard matching
CN1997011A (en) * 2006-07-26 2007-07-11 白杰 Data partition method and data partition device
US20080010278A1 (en) * 2006-07-06 2008-01-10 Lukas Kencl Substring detection system and method
CN101165681A (en) * 2006-10-17 2008-04-23 中兴通讯股份有限公司 Character string matching information processing method in communication system
CN101194256A (en) * 2004-11-12 2008-06-04 谷歌公司 Method and system for autocompletion for languages having ideographs and phonetic characters
CN101243441A (en) * 2005-06-21 2008-08-13 国际字符股份有限公司 Method and apparatus for processing character streams
US20080201140A1 (en) * 2001-07-20 2008-08-21 Gracenote, Inc. Automatic identification of sound recordings

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH10503609A (en) * 1994-07-26 1998-03-31 インターナショナル データ マトリックス,インコーポレイティド Method and system for manufacturing an uncorrectable self-verifying article and verifying its authenticity
US7607018B2 (en) * 2001-05-08 2009-10-20 Ip.Com, Inc. Method and apparatus for collecting electronic signatures
CN1567174A (en) * 2003-06-09 2005-01-19 吴胜远 Method for expressing and processing object and apparatus thereof
CN1972292B (en) * 2005-10-17 2012-09-26 飞塔公司 Systems and methods for processing electronic data
US20080005578A1 (en) * 2006-06-29 2008-01-03 Innovya Research & Development Ltd. System and method for traceless biometric identification

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH04306785A (en) * 1991-04-03 1992-10-29 Mitsubishi Electric Corp Pattern recognition system
US20080201140A1 (en) * 2001-07-20 2008-08-21 Gracenote, Inc. Automatic identification of sound recordings
CN1459761A (en) * 2002-05-24 2003-12-03 清华大学 Character identification technique based on Gabor filter set
US20050086520A1 (en) * 2003-08-14 2005-04-21 Sarang Dharmapurikar Method and apparatus for detecting predefined signatures in packet payload using bloom filters
CN1625121A (en) * 2003-12-05 2005-06-08 中国科学技术大学 Hierarchical cooperated network virus and malice code recognition method
CN101194256A (en) * 2004-11-12 2008-06-04 谷歌公司 Method and system for autocompletion for languages having ideographs and phonetic characters
CN1648901A (en) * 2005-02-03 2005-08-03 中国科学院计算技术研究所 Method and system for large scale keyboard matching
CN101243441A (en) * 2005-06-21 2008-08-13 国际字符股份有限公司 Method and apparatus for processing character streams
US20080010278A1 (en) * 2006-07-06 2008-01-10 Lukas Kencl Substring detection system and method
CN1997011A (en) * 2006-07-26 2007-07-11 白杰 Data partition method and data partition device
CN101165681A (en) * 2006-10-17 2008-04-23 中兴通讯股份有限公司 Character string matching information processing method in communication system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
N. MEKUZ 等: "Adaptive Step Size Window Matching for Detection", 《 18TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR"06)》 *
宋明秋 等: "入侵检测多模式匹配算法", 《计算机工程》 *

Also Published As

Publication number Publication date
CN101960469B (en) 2014-03-26
WO2010047683A1 (en) 2010-04-29
CN101960469A (en) 2011-01-26

Similar Documents

Publication Publication Date Title
US10949641B2 (en) Fast signature scan
US6892207B2 (en) Method of updating data in a compressed data structure
US10095755B2 (en) Fast identification of complex strings in a data stream
US9619585B2 (en) Fast, scalable dictionary construction and maintenance
US20070198566A1 (en) Method and apparatus for efficient storage of hierarchical signal names
US20070255748A1 (en) Method of structuring and compressing labeled trees of arbitrary degree and shape
AU777792B2 (en) System for managing RDBM fragmentations
US20040133581A1 (en) Database management system, data structure generating method for database management system, and storage medium therefor
US20080319987A1 (en) System, method and program for creating index for database
WO2003065175A2 (en) A system and method for real time interface translation
US20020093522A1 (en) Methods of encoding and combining integer lists in a computer system, and computer software product for implementing such methods
EP1352344A2 (en) Efficient searching techniques
US10275486B2 (en) Multi-system segmented search processing
CN108197470A (en) Fast signature scan
Yang et al. Pase: Postgresql ultra-high-dimensional approximate nearest neighbor search extension
CN103793522B (en) Fast signature scan
CN111813744A (en) File searching method, device, equipment and storage medium
US20080306948A1 (en) String and binary data sorting
US10949465B1 (en) Efficient graph tree based address autocomplete and autocorrection
JP3728264B2 (en) Index creation apparatus, search system, and control method
JP3062119B2 (en) Character string search table, method for creating the same, and character string search method
CN104484381B (en) For searching for the method and system of multiple strings
CA2841027C (en) Fast identification of complex strings in a data stream
Dorfman et al. Black Belt Hashigana
CN114911808A (en) Dynamic publishing and subscribing method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20180622