CN102750379B - Fast character string matching method based on filtering type - Google Patents

Fast character string matching method based on filtering type Download PDF

Info

Publication number
CN102750379B
CN102750379B CN201210211829.7A CN201210211829A CN102750379B CN 102750379 B CN102750379 B CN 102750379B CN 201210211829 A CN201210211829 A CN 201210211829A CN 102750379 B CN102750379 B CN 102750379B
Authority
CN
China
Prior art keywords
string
pattern
text string
index
length
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201210211829.7A
Other languages
Chinese (zh)
Other versions
CN102750379A (en
Inventor
李拥军
邹少聪
林浩
黄格仕
谢豪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South China University of Technology SCUT
Original Assignee
South China University of Technology SCUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South China University of Technology SCUT filed Critical South China University of Technology SCUT
Priority to CN201210211829.7A priority Critical patent/CN102750379B/en
Publication of CN102750379A publication Critical patent/CN102750379A/en
Application granted granted Critical
Publication of CN102750379B publication Critical patent/CN102750379B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a fast character string matching method based on a filtering type. According to the method, a pattern string is preprocessed: the prefix P0 of the pattern string is cut into (k+s) pattern string blocks with the length being h, the length of each pattern string block is lengthened by (k+q-1), and the lengthened pattern string blocks are respectively recorded as Q1, Q2 to Q(k+s); and then, from the initial position of a text string, q characters of the text string are sequentially read every h length and are used as text string indexes, and the text string indexes are respectively marked as d1, d2 to dn/h; a matching number group B[d, j] is recreated: if one text string index di belongs to Qj, the matching number group B[di, j] is equal to 1; and the matching number of (k+s) continuous text string indexes and the pattern string blocks is calculated, and finally, the approximate matching is detected. The method has the advantages that a dynamic programming algorithm and a filter algorithm are combined, in addition, new filter strategies are added, the average time of the approximate matching is shortened, and the matching performance is greatly improved.

Description

A kind of character string fast matching method based on filter-type
Technical field
The present invention relates to string matching technical field, particularly relate to a kind of character string fast matching method based on filter-type, belong to information retrieval field and calculation biology field.
Background technology
String matching problem can be defined as and from given symbol sebolic addressing, find out a pattern with certain attribute, and the simplest example is from given character string, to find out a given character string." approximate match " generally refers to and between permission pattern and text-string, has some subtle difference." coupling " generally refers to " approximate match ".String matching is the most ancient in computer science, research one of problem the most widely, and the application of string matching is also seen everywhere.In recent years, academia grows with each passing day to the research interest of string matching, particularly in development swift and violent information retrieval field and calculation biology field.Meanwhile, the application of approximate match is also more and more extensive.For example, the correction to misspelling in information retrieval, the sequence alignment in calculation biology, the correction to error of transmission in signal is processed, etc.Why there is above-mentioned phenomenon, not only because need text scale to be processed increasing, and due to the search that need to become increasingly complex in text in these two research fields.
The research of the coupling of pairing approximation now, mainly based on distance model, needs the distance function that can measure two similarity of character string in this model.In approximate match, conventionally provide pattern string and a corresponding threshold values k, wherein, k has specified the ultimate range allowing between pattern string and matching result.Approximate match is defined as follows: under editing distance model, a given integer k and two character strings are respectively Pattern[1 ... m], Text[1 ... n], find out all substring p in text string Text, make editing distance edit (Pattern, p)≤k.In actual applications, a lot of problems are to have used editing distance model or used its variant.
Existing approximate adaptation method is divided into four classes: the first kind is based on dynamic programming method; Equations of The Second Kind is to search for based on NFA; The 3rd class is based on bit parallel technology, and this is the most successful class methods of current application.Last class is the method based on filtering, it with one fast filter algorithm skip in text those and can not have the successfully bulk region of coupling, and then utilize a non-filter algorithm to verify whether remaining region exists successfully to mate.These class methods are also one of the fastest algorithms.
The algorithm proposing is exactly to utilize the thought of filtering, and the efficiency of algorithm is greatly improved.
Summary of the invention
A kind of method that the object of the present invention is to provide quick approximate pattern matching, especially overcomes the restriction of machine word-length, and a kind of algorithm that is applicable to the approximate match of long pattern string is provided.
For given text string and pattern string, a kind of character string fast matching method based on filter-type of the present invention is in text string, to find the substring with pattern string with certain similarity.String matching problem is exactly in a large text string, to search for the position that occurs of certain pattern string.The present invention has filtered out by filtration thought that in text string, those can not produce the successfully region of coupling, thereby has improved the efficiency of algorithm under average case.
For achieving the above object, the present invention adopts following technical scheme:
A character string fast matching method based on filter-type, comprises the steps:
(1) pattern string is carried out to pre-service: the prefix that note pattern string length is (k+s) h is P 0, by P 0be cut into the pattern string sub-block that k+s length is h, the length of each pattern string sub-block is extended to k+q-1, between two continuous mode string sub-blocks, having length is the lap of k+q-1; This k+s extends pattern string sub-block afterwards and is designated as respectively Q 1, Q 2..., Q k+s; Wherein
Figure BDA00001803025800021
m is the number of pattern string character; K, for pattern string is transformed into the needed minimal action number of times of certain substring of text string, is called for short editing distance, 0≤k ﹤ m; S is the minimum number with the factor exact matching of pattern string sub-block at matching process Chinese version string indexing (explanation of text string index is in step 2), 1≤s ﹤ m; Q is the length of text string index, q≤h;
(2) create text string index: from text string reference position, q the character that reads successively text string every h length is as text string index, and text string index is labeled as respectively d 1, d 2... d n/h; Q≤h;
(3) establishment coupling array B d, j]: if some text string index d ibelong to Q j, B [ d i, j]=1; Otherwise B[d i, j]=0; For each text string creates the array M that length is m ', for the coupling number of recording text string indexing, note d icorresponding array is M di[1 ... m '], M di[1 ... m '] be initialized as 0; Wherein m '=k+s;
(4) calculate the number of mating of the continuous text string index of k+s and pattern string sub-block: text string index is mated with pattern string sub-block; In the process of record matching situation, if there is M dithe situation of [j]≤j-k (having j>k here), stops building this array, then continues to build the array M of next index di+1; If element M di[m ']>=s, need further to detect approximate match, continue next step;
(5) detect approximate match: if there is approximate match, region to be detected is positioned at j-(k+s) h-2k-q+2 to j+m-(k+s-1) the h+k-q part of text string, and detect this region by dynamic programming algorithm, if detect approximate match, provide the position that occurs of approximate match, otherwise report that this region does not exist approximate match.
Character string is the character string being defined on finite alphabet.Text string is a character string, and pattern string is also a character string.String matching problem is exactly in a large text string, to search for the position that occurs of certain pattern string.Given character string x, y and z, claim that x is a prefix of xy, and x is a suffix of yx, and x is a factor of yxz.
Principle of the present invention: if text string is divided into k+s sheet, so for any one successful approximate match, have s sheet at least is exact matching in text string.Utilize above-mentioned thought, the present invention is by the substring number of exact matching in structure q-gram index calculation text string, determine whether have the approximate match of existing in this g-gram index section region may.It is the moving window that q, sliding distance are d that g-gram index adopts length while referring in the time setting up index text Text cutting, and cutting obtains the substring that length is q.In order to calculate the number of mating of q-gram and pattern string, introduce coupling array B[d, j], if some q-gram d belong to Q j, B[d, j]=1; Otherwise B[d, j]=0.Like this, for each q-gramd, B[d, *] provide the match information of each piece d.
After having determined that there is the possibility of the approximate match of existing in certain section of region, just need to verify this section of region.Application more widely verification algorithm based on bit parallel technology, such as BPM algorithm.But the method has been applicable to the approximate match checking more in short-term of pattern string Length Ratio.The verification algorithm that the present invention chooses is the DP algorithm based on dynamic programming matrix.The thought of this algorithm is: the normal mode string number that success is mated in text is little, therefore, in editing distance matrix M, the value of each row is calculated from top to bottom and is reached soon k+1 (not mated), if the value of certain matrix element is greater than k+1, Search Results just can not depend on this element so.If the value of a matrix element is less than or equal to k, so this element be just called movable.The method only calculates last mobile element of each row, does not need element below to be calculated again.
The present invention, compared with existing approximate match algorithm, has following remarkable advantage:
(1) a large amount of bit parallel methods that use in proximity search, a lot of well results are all to use bit parallel method to obtain.But bit parallel algorithm is applicable to the shorter situation of pattern string, but in current demand, a lot of search are all that long pattern string is carried out.If the length of pattern string has exceeded the word length of computing machine, the effect of coupling is by the bad luck becoming very so.And this approximate search algorithm is applicable to the approximate match of long pattern string very much, in the time of the length m of pattern string and the increase of the ratio of threshold values k, its efficiency also improves accordingly.
(2), before not increasing new filtering policy, the time overhead of approximate match algorithm is mainly calculate array M and verify possible matching area by DP algorithm.By increasing filtering policy, while having simplified coupling, calculate in each q-gram index and pattern string the match time of each, thereby greatly shortened the time of calculating array M.Under best-case, this algorithm can shorten than former algorithm the match time of half.
(3) filter algorithm can only improve the efficiency of algorithm under average case, and its advantage is possible not need to check the each character in text.The efficiency of filter algorithm is relevant with the size text that it can filter out, and to error level sensitivity very.Most of filter algorithms are worked finely in the time that error level is lower, otherwise just very poor.And this filter algorithm is in the time that error level is higher, still keep good performance.Experiment shows, in the time that error level is higher, this algorithm can reduce half the match time more required than similar algorithm.
Brief description of the drawings
Fig. 1 is the processing schematic diagram of structure index to text string;
Fig. 2 is the processing schematic diagram of pattern string.
Embodiment:
In order to understand better the present invention, below in conjunction with embodiment, the invention will be further described, but the scope of protection of present invention is not limited to the scope that embodiment represents.
Case study on implementation 1: SMS classification
In recent years, China had accelerated social informatization and digitized developing steps, and people's every day, information to be processed was more and more.Each application service provider can notify client in the mode of note in order to remind or to allow client understand as early as possible relevant information, as e-business network commercial product recommending, Web bank's information feedback, the various notice of deducting fees also have user's chat message etc. itself.But along with the business of people's contact is more and more, these information become numerous and diverse management that is difficult to.Everybody needs a kind of very convenient, loose, effective mode note of classifying, and breaks away from the predicament of directly opening inbox reply various information.Application the present invention " a kind of character string fast matching method based on filter-type " can realize by the classify function of short breath of masterplate.When user need to divide a new information category, the information of one or more of the type only need to be provided, the note that system just can utilize the way automatic classification of approximate match to receive in the future, applying step is as follows:
(1) user sets up a new info class, and one or some notes sample as such is provided, and these samples are exactly such pattern string.New short message content is as target text string, if certain quasi-mode string and text string match, text string belongs to such.
(2) length of supposing note sample is m=32, the length q=2 of short message text string indexing; It is k=4 that user sets similarity, the minimum number s=2 of the factor exact matching of note index and pattern string sub-block in matching process; Wherein
Figure BDA00001803025800041
as shown in Figure 2, pattern string is carried out to pre-service: the prefix that note pattern string length is (k+s) h=24 is P 0, by P 0be cut into the pattern string sub-block that k+s=6 length is h=4, the length of each pattern string sub-block is extended to k+q-1=5, between two continuous mode string sub-blocks, having length is the lap of k+q-1=5; This k+s=6 extends pattern string sub-block afterwards and is designated as respectively Q 1, Q 2..., Q 6;
(3) as shown in Figure 1,, for new note, suppose that its length is 128; Create the index of note: from note reference position, as new message index, note index is labeled as respectively d every q=2 character of h=4 length successively short message reading 1, d 2... d 32;
(4) create coupling array B[d, j], if some note index d ibelong to Q j, B[d i, j]=1; Otherwise B[d i, j]=0; For each note creates the array M that length is m ', for recording the coupling number of note index, note d icorresponding array is M di[6], M di[6] be initialized as 0;
(5) calculate the number of mating of the continuous note index of k+s=6 and pattern string sub-block: note index is mated with pattern string sub-block; In the process of record matching situation, if there is M dithe situation of [b]≤b-4 (having b>4 here), stops building this array, then continues to build the array M of next index di+1; If element M di[6]>=2, need further to detect approximate match, continue next step;
(6) detect approximate match: if there is approximate match, region to be detected is arranged in j-32 to the j+14 part (j is certain end position at diary document that once mates continuous 6 last indexes of index) of note, and detect this region by dynamic programming algorithm, if detect approximate match, provide the position that occurs of approximate match, otherwise report that this region does not exist approximate match.A note can belong to multiple classes, just proves that the partial content of this note meets the masterplate requirement of certain class as long as the match is successful.
Case study on implementation 2: diary document is searched for generally
Along with developing rapidly and the quickening of the speed of living of computer technology, with respect to keep a diary, the finish the work situation of document of paper using pen in the past, people are more partial to complete text work with computing machine now.Mainly contain three reasons: 1) computer technology makes people complete text work more easily and be quick; 2) computing machine diary has more amusement color, as editor's mood; 3) meet people's requirement of record conveniently.After finishing than one day, complete one section of diary or work report, instant recording more meets people's demand, along with the development of mobile phone hardware technology, handwritten word identification and speech recognition technology, we are more convenient more directly records daily information, by Computer Recognition Technology, these multimedia messagess will be converted into text and store.
For these more at random, documents that literary is not strong, application the present invention " a kind of quick approximate pattern matching method of filter-type " can meet user and search for the demand of daily record generally, and applying step is as follows:
(1) user specifies diary document as the text string in the present embodiment, and input inquiry statement is as the pattern string in the present invention;
(2) length of query statement is designated as m=16, and user sets inquiry similarity k=2, and in matching process, the minimum number s of the factor exact matching of diary document index and query statement sub-block is 1, the length q=3 of diary document index; ?
Figure BDA00001803025800051
query statement is carried out to pre-service: the prefix that the length of note query statement is (k+s) h=12 is P 0, by P 0be cut into the query statement sub-block that k+s=3 length is 4, the length of each query statement sub-block is extended to k+q-1=4, between two continuous-query statement sub-blocks, having length is 4 lap; Query statement sub-block after these 3 prolongations is designated as respectively Q 1, Q 2, Q 3;
(3) create diary document index: the length of supposing diary is 1000.From diary document reference position, 3 characters that read successively diary document every 4 characters are as diary document index, and diary document index is labeled as respectively d 1, d 2... d 250;
(4) create coupling array B[d, j], if some diary document index d ibelong to Q j, B[d i, j]=1; Otherwise B[d i, j]=0; For the array M that each diary document creation length is 3, for the coupling number of recording diaries document index, remember d icorresponding array is M di[3], M di[3] be initialized as 0;
(5) calculate the number of mating of 3 continuous diary document index and query statement sub-block: diary document index is mated with query statement sub-block; In the process of record matching situation, if there is M dithe situation of [b]≤b-2 (having b>2 here), stops building this array, then continues to build the array M of next index di+1; If element M di[m ']>=s=1, need further to detect approximate match, continue next step;
(6) detect approximate match: if there is approximate match, region to be detected is arranged in j-17 to j+7 (j is certain end position at diary document that once mates continuous 3 last indexes of the index) part of diary document, and detect this region by dynamic programming algorithm, if detect approximate match, provide the position that occurs of approximate match, otherwise report that this region does not exist approximate match.As long as exist the somewhere document that the match is successful to provide user to check in detail as candidate documents.

Claims (3)

1. the character string fast matching method based on filter-type, is characterized in that comprising the steps:
(1) pattern string is carried out to pre-service: the prefix that note pattern string length is (k+s) h is P 0, by P 0be cut into the pattern string sub-block that k+s length is h, by the length of each pattern string sub-block extend k+q ?1, between two continuous mode string sub-blocks, exist length be k+q ?1 lap; This k+s extends pattern string sub-block afterwards and is designated as respectively Q 1, Q 2..., Q k+s; Wherein
Figure FDA0000493434690000011
m is the number of pattern string character; K, for pattern string is transformed into the needed minimal action number of times of certain substring of text string, is called for short editing distance, 0≤k ﹤ m; S is the minimum number at the factor exact matching of matching process Chinese version string indexing and pattern string sub-block, 1≤s ﹤ m; Q is the length of text string index, q≤h;
(2) create text string index: from text string reference position, every h length reads q character of text string successively as text string index, and text string index is labeled as respectively d 1, d 2... d n/h; Q≤h; N is the length of text string;
(3) create coupling array B[d, j]: if some text string index d ibelong to Q j, mate array B[d i, j]=1; Otherwise coupling array B[d i, j]=0; For each text string creates the array M that length is m ', for the coupling number of recording text string indexing, note d icorresponding array is M di[m '], M di[m '] be initialized as 0; Wherein m '=k+s;
(4) calculate the number of mating of the continuous text string index of k+s and pattern string sub-block: text string index is mated with pattern string sub-block; In the process of record matching situation, if there is M dithe situation of [j]≤j-k, stops building this array, then continues to build the array M of next index di+1; If element M di[m ']>=s, need further to detect approximate match, continue next step;
(5) detect approximate match: if there is approximate match, region to be detected be positioned at text string w ?(k+s) h ?2k ?q+2 to w+m ?(k+s ?1) h+k ?q part, and detect this region by dynamic programming algorithm, if detect approximate match, provide the position that occurs of approximate match, otherwise report that this region does not exist approximate match; Described w is the end position in text string of continuous last index of k+s index during certain once mates;
Described text string is character string, and pattern string is character string.
2. the character string fast matching method based on filter-type according to claim 1, is characterized in that: described character string is the character string being defined on finite alphabet.
3. the character string fast matching method based on filter-type according to claim 1, is characterized in that: given character string x, and y and z, claim that x is a prefix of xy, and x is a suffix of yx, and x is a factor of yxz.
CN201210211829.7A 2012-06-25 2012-06-25 Fast character string matching method based on filtering type Expired - Fee Related CN102750379B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210211829.7A CN102750379B (en) 2012-06-25 2012-06-25 Fast character string matching method based on filtering type

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210211829.7A CN102750379B (en) 2012-06-25 2012-06-25 Fast character string matching method based on filtering type

Publications (2)

Publication Number Publication Date
CN102750379A CN102750379A (en) 2012-10-24
CN102750379B true CN102750379B (en) 2014-07-02

Family

ID=47030564

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210211829.7A Expired - Fee Related CN102750379B (en) 2012-06-25 2012-06-25 Fast character string matching method based on filtering type

Country Status (1)

Country Link
CN (1) CN102750379B (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103544139A (en) * 2012-07-13 2014-01-29 江苏新瑞峰信息科技有限公司 Forward word segmentation method and device based on Chinese retrieval
CN103577598B (en) * 2013-11-15 2017-02-15 曙光信息产业(北京)有限公司 Matching method and device for pattern string and text string
CN107451125B (en) * 2017-08-19 2021-05-18 洪志令 Method for performing rapid close semantic matching aiming at sequence-independent item groups
CN109062792A (en) * 2018-07-21 2018-12-21 东南大学 A kind of Open Source Code detection method based on String matching and characteristic matching
CN109241360B (en) * 2018-08-21 2021-08-20 创新先进技术有限公司 Matching method and device of combined character strings and electronic equipment
CN109271419B (en) * 2018-09-03 2021-06-22 河北工业大学 Online string matching method without gap constraint
CN111832554A (en) * 2019-04-15 2020-10-27 顺丰科技有限公司 Image detection method, device and storage medium
CN111159490B (en) * 2019-12-13 2023-05-26 杭州迪普科技股份有限公司 Method, device and equipment for processing pattern character strings
CN111814009B (en) * 2020-06-28 2022-03-01 四川长虹电器股份有限公司 Mode matching method based on search engine retrieval information
CN117494231A (en) * 2023-11-15 2024-02-02 山东农业大学 Distributed data management monitoring system based on big data

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
"On using q-gram locations in approximate string matching";Erkki Sutinen et al.;《Lecture Notes in Computer Science》;19951231;第979卷;327-340 *
"一种快速近似模式匹配算法";李拥军等;《华南理工大学学报(自然科学版)》;20120630;第40卷(第6期);103-108 *
Erkki Sutinen et al.."On using q-gram locations in approximate string matching".《Lecture Notes in Computer Science》.1995,第979卷327-340.
孙德才等."基于匹配区域特征的相似字符串匹配过滤算法".《计算机研究与发展》.2010,第47卷(第4期),663-670.
李拥军等."一种快速近似模式匹配算法".《华南理工大学学报(自然科学版)》.2012,第40卷(第6期),103-108.

Also Published As

Publication number Publication date
CN102750379A (en) 2012-10-24

Similar Documents

Publication Publication Date Title
CN102750379B (en) Fast character string matching method based on filtering type
CN109241274B (en) Text clustering method and device
Tolias et al. Visual query expansion with or without geometry: refining local descriptors by feature aggregation
US8577155B2 (en) System and method for duplicate text recognition
CN102129451B (en) Method for clustering data in image retrieval system
US20150142760A1 (en) Method and device for deduplicating web page
CN101950312B (en) Method for analyzing webpage content of internet
CN108647322B (en) Method for identifying similarity of mass Web text information based on word network
CN103345496B (en) multimedia information retrieval method and system
WO2008026414A1 (en) Image recognition method, image recognition device, and image recognition program
CN103258000A (en) Method and device for clustering high-frequency keywords in webpages
CN101976253A (en) Chinese variation text matching recognition method
Klampfl et al. Unsupervised document structure analysis of digital scientific articles
CN105677661A (en) Method for detecting repetition data of social media
CN107341199B (en) Recommendation method based on document information commonality mode
CN106055539A (en) Name disambiguation method and apparatus
JP5558514B2 (en) Method and apparatus for optimally processing N-sort queries in multi-range scanning
CN107784110A (en) A kind of index establishing method and device
CN105488471B (en) A kind of font recognition methods and device
CN101515286B (en) Image matching method based on image feature multi-level filtration
CN116562297B (en) Chinese sensitive word deformation identification method and system based on HTRIE tree
WO2008073820A1 (en) Identifying relationships among database records
CN103631769A (en) Method and device for judging consistency between file content and title
Xue et al. Dealing with imbalanced dataset: A re-sampling method based on the improved SMOTE algorithm
CN103365934A (en) Extracting method and device of complex named entity

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20140702

Termination date: 20210625

CF01 Termination of patent right due to non-payment of annual fee