WO2004081819A1 - A method and system for pattern matching - Google Patents
A method and system for pattern matching Download PDFInfo
- Publication number
- WO2004081819A1 WO2004081819A1 PCT/IN2004/000059 IN2004000059W WO2004081819A1 WO 2004081819 A1 WO2004081819 A1 WO 2004081819A1 IN 2004000059 W IN2004000059 W IN 2004000059W WO 2004081819 A1 WO2004081819 A1 WO 2004081819A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- pattern
- source
- sequence
- target
- group leader
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F7/00—Methods or arrangements for processing data by operating upon the order or content of the data handled
- G06F7/02—Comparing digital values
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2207/00—Indexing scheme relating to methods or arrangements for processing data by operating upon the order or content of the data handled
- G06F2207/02—Indexing scheme relating to groups G06F7/02 - G06F7/026
- G06F2207/025—String search, i.e. pattern matching, e.g. find identical word or best match in a string
Definitions
- Pattern matching is the process of finding some or all of the occurrences of a target pattern in a source pattern.
- Compressed pattern matching is the process of finding some or all of the occurrences of a target pattern in a compressed source pattern without decompressing the source pattern.
- Pattern matching is the application of analytical rules to a block of data to identify a feature of that block of data.
- the most common pattern matching problem is the process of finding some or all occurrences of a sequence of elements [Y1...Ym] (target pattern) within a larger sequence of elements [X1...Xn] (source pattern).
- the elements come from a finite element set - an alphabet set.
- the set may be the English alphabet, ⁇ 0,1 ⁇ , natural numbers, etc.
- the most popular algorithms for this problem are the Knuth-Morris-Pratt algorithm, the Boyer-Moore algorithm and the Rabin-Karp algorithm.
- Pattern matching is used for simple text search, searching for data in image data, speech data, video data, audio data, bio-medical sequence analysis, etc.
- Data compression is mainly used for reducing storage space and to speed up data transmission.
- Various forms of compression are known. Of particular interest is arithmetic coding compression for which compressed pattern matching has not heretofore been possible. Arithmetic coding originated in the 1970s and 1980s (see for example US 4,122,440). Arithmetic coding is used in several applications, including Speech and Medical Image compression.
- compressed pattern matching the pattern matching is performed in the compressed domain.
- compressed pattern matching of, for example, text strings can be stated as:
- ac be a given compression algorithm
- ac(D) be the result of ac compressing data D Input: compressed text ac(T) and compressed pattern ac(P)
- Output all or some locations in T where pattern P occurs.
- a method of determining whether a target pattern is present within a source pattern composed of one or more characters from an alphabet set by determining whether the position of the source pattern within a sequence of possible patterns is a position which correlates with a position within the sequence of possible patterns that includes the target pattern.
- a data processing apparatus to determine whether a target pattern is present within a source pattern composed of one or more characters of an alphabet, comprising: i. a first memory for storing a target pattern; ii. a second memory for storing a source pattern; iii. a processing means for determining the position of the source pattern in a sequence of possible patterns; iv. a processing means for determining the position of the target pattern in the sequence of possible patterns; and v. a processing means for correlating the source position with the target position.
- a data processing apparatus to determine whether a target sequence is present within a source sequence composed of one or more characters of an alphabet, comprising: i. a first memory for storing a target sequence position wherein the target sequence position is the position of the target sequence within a lexicon of all possible combinations of characters of the alphabet; ii. a second memory for storing a source sequence position wherein the source sequence position is the position of the source sequence within the lexicon; iii.
- a processing means for computing a set of positions of sub-sequences of the source sequence wherein the position of the sub-sequence is the position of the sub-sequence within the lexicon and wherein the subsequence includes the first character position of the source sequence; iv. a processing means for determining a series defining all positions of sequences within the lexicon which contain the target sequence; v. a processing means for correlating the set with the series.
- Figure 1 is a chart of positions of source pattern strings of length 4, containing the target pattern '1', where the alphabet set is ⁇ 0,1 ⁇ ;
- Figure 2 is a chart of source pattern strings of length 4, containing the target pattern 'b', where the alphabet set is ⁇ a,b,c,d,e ⁇ ;
- Figure 3 is a chart of position numbers of source pattern strings of length 4, containing the target pattern 'fo', where the alphabet set is ⁇ a,b,c,d,e ⁇ ;
- Figure 4 is a chart as in figure 3, where the position numbers are set out to scale.
- the source pattern string could contain zero, one or several matches of the target pattern string.
- Figure 1 shows a simple scenario for the purpose of illustration.
- the alphabet set ⁇ 0,1 ⁇ and a source data length, L t , of 4.
- the possible source pattern strings are then all binary numbers from 0000 to 1111.
- the positions of the source pattern strings in a numerically ordered set are also shown in the figure.
- the position numbers of source pattern strings satisfying the pattern match conditions are shown for the four possible target pattern positions within the source pattern string. For example, in row 4 the target pattern '1' occupies the first position in the source pattern string. This condition is satisfied for the binary numbers 1000 to 1111 , which have position numbers from 9 to 16.
- the position numbers of source pattern strings where the target pattern occurs form one or several groups of successive numbers with breaks between groups.
- the number of elements in each group depends on the row, and hence the location of the target pattern within the source pattern string.
- each group contains a single element, with a gap of 1 between each group.
- each group contains 4 elements, with a gap of 4 between each group.
- the elements of row 1 form an arithmetic series, where the starting element, b 0 , is equal to the position of the target pattern, P p , and the difference is the number of elements in the alphabet set, N, raised to the power of the length of the target pattern (i.e. the number of elements in the target pattern), L p :
- each row then has a group leader series.
- the group leader series of any row is related to the group leader series of an adjacent row by a factor of N, the number of elements in the alphabet set.
- the algorithm searches for pattern matches at each possible target pattern position within the source pattern string, that is one row at a time.
- L iter (2) which is incremented or decremented by 1 to move one row at a time.
- this algorithm searches for all pattern matches in the source data. However, a similar algorithm could easily search for a single pattern match, ending immediately after a pattern match is found. In this case, empirical knowledge of the strings involved could significantly speed up the algorithm.
- the sequence of searching may depend upon a characteristic of the source pattern string. If it were known a priori that the target pattern was likely to be contained towards the end of the source pattern string we should start with the assumption that the target pattern was contained in row 1 and move progressively one row at a time towards row (L D + 1 ). On the other hand, if the target pattern was likely to be contained towards the beginning of the source pattern string, we would move progressively from row (L D + 1) to row 1.
- Another similar algorithm could search for a pattern match at a particular position in the source pattern string.
- the sequence of searching may be based upon an analysis of the source pattern string and occur according to the estimated probability for each row. It will be appreciated that a range of search sequences may be employed depending upon characteristics of the source data.
- the step of calculating the group leader position can be achieved as follows.
- the position, P j of the lowest member of each possible group can be represented by:
- N is the number of elements in the alphabet and r is the row number.
- P p is the position of the target pattern within an ordered sequence of possible target patterns and L p is the number of elements in the target pattern.
- N Number of elements in the alphabet. 7 - A source pattern string. P - A target pattern string. L t - Length of T.
- R h High value of the range of ac output for a particular string.
- a string is represented by an interval on a number line.
- the size of the interval is determined by the probabilities of the symbols of the alphabet. In the equiprobable case, it is a simple matter to calculate the position number of a string:
- R h is the highest value of the interval and R ⁇ is the lowest value of the interval.
- the position number can still be determined, although this is not as straightforward as in the equiprobable case.
- the source sequence can be represented as a position in a list of all possible combinations of the elements of the alphabet with the length of the source sequence - the source lexicon.
- the source sequence itself is in fact that position when represented in base n (n is the number of elements in the alphabet) where the elements of the alphabet represent digits in base n.
- a set of the positions of all sub-sequences of the source sequence which include the first character position of the source sequence, within a lexicon of all possible combinations of the elements of the alphabet with the length of the sub-sequence, can be computed.
- the set will contain the following elements: the first digit of the source sequence in base n, the first digit and the second digit, the first digit and the second and the third, and so on for the length of the source sequence position in base n.
- "words" containing the target sequence are n to the power of the length of the target sequence apart. Using the position of the first "word” containing the target sequence, a series of positions of "words" containing the target sequence can therefore be defined.
- the position of the element in the set which also matches with the target t series describee the position in the source sequence of the rightmost element of the target sequence.
- ⁇ et of source sub-sequence positions may be matched against the series of possible target positions in O(n) by using the remainder equation.
- a target sequence tX....Xm] is present within a e ⁇ ure ⁇ s ⁇ qu ⁇ r ⁇ [Y ⁇ ...YJ u ⁇ r ⁇ ⁇ X ⁇ ..._ an
- the method can be implemented using a range of standard data processing devices or specialised graphics audio processing or sequencing devices.
- the method may be deployed on a personal computer operating under the MICROSOFT WINDOWSTM environment or on a server machine operating under the UNIX operating system.
- the method may be implemented in software executing on a device or implemented in specialised hardware.
- the present invention represents the first method for pattern matching of arithmetically compressed source data without decompression of the data.
- the invention can also operate using a compressed target pattern string.
- the invention should be of immense utility at arithmetic decoder stations, where the decoder could typically check for partial or full matches with any prior strings of data.
- the invention can be used to find all pattern matches, one pattern match or a pattern match at a particular position in the source pattern string.
- the invention can also be executed in parallel, as a search on a particular interval can be performed independently of searches on other intervals. That is, the invention can be applied to more than one source string at the same time.
- the invention could also operate in a decentralised system, for example a search engine accessible over a communications network.
Landscapes
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Computational Mathematics (AREA)
- Mathematical Analysis (AREA)
- Mathematical Optimization (AREA)
- Pure & Applied Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Compression, Expansion, Code Conversion, And Decoders (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
Description
Claims
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
EP04720131A EP1602041A1 (en) | 2003-03-13 | 2004-03-12 | A method and system for pattern matching |
JP2006507626A JP2006522401A (en) | 2003-03-13 | 2004-03-12 | Pattern matching method and system |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US10/386,462 | 2003-03-13 | ||
US10/386,462 US7840072B2 (en) | 2003-03-13 | 2003-03-13 | Method and system for pattern matching |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2004081819A1 true WO2004081819A1 (en) | 2004-09-23 |
Family
ID=32987317
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/IN2004/000059 WO2004081819A1 (en) | 2003-03-13 | 2004-03-12 | A method and system for pattern matching |
Country Status (5)
Country | Link |
---|---|
US (1) | US7840072B2 (en) |
EP (1) | EP1602041A1 (en) |
JP (1) | JP2006522401A (en) |
CN (1) | CN1759394A (en) |
WO (1) | WO2004081819A1 (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2007116549A (en) * | 2005-10-21 | 2007-05-10 | Mitsubishi Electric Corp | Network relay unit |
Families Citing this family (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
TW530495B (en) * | 2001-05-16 | 2003-05-01 | Cyberlink Corp | System and method for processing compressed data stream |
US7810155B1 (en) * | 2005-03-30 | 2010-10-05 | Symantec Corporation | Performance enhancement for signature based pattern matching |
US8930431B2 (en) | 2010-12-15 | 2015-01-06 | International Business Machines Corporation | Parallel computation of a remainder by division of a sequence of bytes |
US20150302050A1 (en) * | 2012-05-24 | 2015-10-22 | Iqser Ip Ag | Generation of requests to a data processing system |
CN103873317B (en) * | 2012-12-18 | 2017-04-12 | 中国科学院空间科学与应用研究中心 | Method and system for detecting CCSDS (consultative committee for space data system) space link protocol |
CN104252469B (en) | 2013-06-27 | 2017-10-20 | 国际商业机器公司 | Method, equipment and circuit for pattern match |
WO2015132914A1 (en) * | 2014-03-05 | 2015-09-11 | 三菱電機株式会社 | Data compression apparatus and data compression method |
KR101595189B1 (en) * | 2014-11-14 | 2016-02-19 | 인하대학교 산학협력단 | A method of pattern matching on compressed texts based on boyer-moore-horspool algorithm |
US9959299B2 (en) | 2014-12-02 | 2018-05-01 | International Business Machines Corporation | Compression-aware partial sort of streaming columnar data |
CN105893337B (en) * | 2015-01-04 | 2020-07-10 | 伊姆西Ip控股有限责任公司 | Method and apparatus for text compression and decompression |
US10909078B2 (en) * | 2015-02-25 | 2021-02-02 | International Business Machines Corporation | Query predicate evaluation and computation for hierarchically compressed data |
US9941004B2 (en) | 2015-12-30 | 2018-04-10 | International Business Machines Corporation | Integrated arming switch and arming switch activation layer for secure memory |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4122440A (en) | 1977-03-04 | 1978-10-24 | International Business Machines Corporation | Method and means for arithmetic string coding |
US5936559A (en) * | 1997-06-09 | 1999-08-10 | At&T Corporation | Method for optimizing data compression and throughput |
-
2003
- 2003-03-13 US US10/386,462 patent/US7840072B2/en not_active Expired - Fee Related
-
2004
- 2004-03-12 CN CNA2004800065532A patent/CN1759394A/en active Pending
- 2004-03-12 JP JP2006507626A patent/JP2006522401A/en not_active Withdrawn
- 2004-03-12 EP EP04720131A patent/EP1602041A1/en not_active Withdrawn
- 2004-03-12 WO PCT/IN2004/000059 patent/WO2004081819A1/en active Application Filing
Non-Patent Citations (3)
Title |
---|
BELL ET AL: "Pattern Matching in Compressed Text and Images", TECHNICAL REPORTS, TR-COSC, no. 07/01, 29 May 2001 (2001-05-29), UNIVERSITY OF CANTERBURY, XP002292206, Retrieved from the Internet <URL:HTTP://WWW.COSC.CANTERBURY.AC.NZ/RESEARCH/REPORTS/TECHREPS/2001/TR_0107.PDF> [retrieved on 20040810] * |
LANGDON G: "An introduction to arithmetic coding", IBM JOURNAL OF RESEARCH AND DEVELOPMENT, IBM CORPORATION, ARMONK, US, vol. 28, no. 2, 1 March 1984 (1984-03-01), pages 135 - 149, XP002086019, ISSN: 0018-8646 * |
WITTEN I H ET AL: "ARITHMETIC CODING FOR DATA COMPRESSION", COMMUNICATIONS OF THE ASSOCIATION FOR COMPUTING MACHINERY, ASSOCIATION FOR COMPUTING MACHINERY. NEW YORK, US, vol. 30, no. 6, 1 June 1987 (1987-06-01), pages 520 - 540, XP000615171, ISSN: 0001-0782 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2007116549A (en) * | 2005-10-21 | 2007-05-10 | Mitsubishi Electric Corp | Network relay unit |
JP4627243B2 (en) * | 2005-10-21 | 2011-02-09 | 三菱電機株式会社 | Network relay device |
Also Published As
Publication number | Publication date |
---|---|
JP2006522401A (en) | 2006-09-28 |
CN1759394A (en) | 2006-04-12 |
US7840072B2 (en) | 2010-11-23 |
EP1602041A1 (en) | 2005-12-07 |
US20040199931A1 (en) | 2004-10-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109635273B (en) | Text keyword extraction method, device, equipment and storage medium | |
EP0510634B1 (en) | Data base retrieval system | |
US7454431B2 (en) | Method and apparatus for window matching in delta compressors | |
Gueniche et al. | Compact prediction tree: A lossless model for accurate sequence prediction | |
KR100414236B1 (en) | A search system and method for retrieval of data | |
US8908978B2 (en) | Signature representation of data having high dimensionality | |
WO2004081819A1 (en) | A method and system for pattern matching | |
Gawrychowski | Pattern matching in Lempel-Ziv compressed strings: fast, simple, and deterministic | |
US6542644B1 (en) | Statistical data compression/decompression method | |
CN109299235B (en) | Knowledge base searching method, device and computer readable storage medium | |
US9069634B2 (en) | Signature representation of data with aliasing across synonyms | |
Amir et al. | Repetition detection in a dynamic string | |
Boffa et al. | A learned approach to design compressed rank/select data structures | |
Boffa et al. | A “Learned” Approach to Quicken and Compress Rank/Select Dictionaries∗ | |
Puglisi et al. | Data compression and learning in time sequences analysis | |
CN114567332A (en) | Text secondary compression method, device and equipment and computer readable storage medium | |
CN110674635B (en) | Method and device for dividing text paragraphs | |
Bell et al. | Searching BWT compressed text with the Boyer-Moore algorithm and binary search | |
JP3545007B2 (en) | Database search system | |
CN111538803A (en) | Method, device, equipment and medium for acquiring candidate question text to be matched | |
Ehlers et al. | k-Abelian pattern matching | |
Lysyak et al. | Time series prediction based on data compression methods | |
CN114238564A (en) | Information retrieval method and device, electronic equipment and storage medium | |
JP2993540B2 (en) | Ascending integer sequence data compression and decoding system | |
Gagie et al. | Space-efficient conversions from SLPs |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AK | Designated states |
Kind code of ref document: A1 Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BW BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE EG ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NA NI NO NZ OM PG PH PL PT RO RU SC SD SE SG SK SL SY TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW |
|
AL | Designated countries for regional patents |
Kind code of ref document: A1 Designated state(s): BW GH GM KE LS MW MZ SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IT LU MC NL PL PT RO SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
DPEN | Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed from 20040101) | ||
WWE | Wipo information: entry into national phase |
Ref document number: 2004720131 Country of ref document: EP |
|
WWE | Wipo information: entry into national phase |
Ref document number: 1727/KOLNP/2005 Country of ref document: IN |
|
WWE | Wipo information: entry into national phase |
Ref document number: 20048065532 Country of ref document: CN Ref document number: 2006507626 Country of ref document: JP |
|
WWP | Wipo information: published in national office |
Ref document number: 2004720131 Country of ref document: EP |