CN102750379B

CN102750379B - Fast character string matching method based on filtering type

Info

Publication number: CN102750379B
Application number: CN201210211829.7A
Authority: CN
Inventors: 李拥军; 邹少聪; 林浩; 黄格仕; 谢豪
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2012-06-25
Filing date: 2012-06-25
Publication date: 2014-07-02
Anticipated expiration: 2032-06-25
Also published as: CN102750379A

Abstract

The invention discloses a fast character string matching method based on a filtering type. According to the method, a pattern string is preprocessed: the prefix P0 of the pattern string is cut into (k+s) pattern string blocks with the length being h, the length of each pattern string block is lengthened by (k+q-1), and the lengthened pattern string blocks are respectively recorded as Q1, Q2 to Q(k+s); and then, from the initial position of a text string, q characters of the text string are sequentially read every h length and are used as text string indexes, and the text string indexes are respectively marked as d1, d2 to dn/h; a matching number group B[d, j] is recreated: if one text string index di belongs to Qj, the matching number group B[di, j] is equal to 1; and the matching number of (k+s) continuous text string indexes and the pattern string blocks is calculated, and finally, the approximate matching is detected. The method has the advantages that a dynamic programming algorithm and a filter algorithm are combined, in addition, new filter strategies are added, the average time of the approximate matching is shortened, and the matching performance is greatly improved.

Description

A kind of character string fast matching method based on filter-type

Technical field

The present invention relates to string matching technical field, particularly relate to a kind of character string fast matching method based on filter-type, belong to information retrieval field and calculation biology field.

Background technology

String matching problem can be defined as and from given symbol sebolic addressing, find out a pattern with certain attribute, and the simplest example is from given character string, to find out a given character string." approximate match " generally refers to and between permission pattern and text-string, has some subtle difference." coupling " generally refers to " approximate match ".String matching is the most ancient in computer science, research one of problem the most widely, and the application of string matching is also seen everywhere.In recent years, academia grows with each passing day to the research interest of string matching, particularly in development swift and violent information retrieval field and calculation biology field.Meanwhile, the application of approximate match is also more and more extensive.For example, the correction to misspelling in information retrieval, the sequence alignment in calculation biology, the correction to error of transmission in signal is processed, etc.Why there is above-mentioned phenomenon, not only because need text scale to be processed increasing, and due to the search that need to become increasingly complex in text in these two research fields.

The research of the coupling of pairing approximation now, mainly based on distance model, needs the distance function that can measure two similarity of character string in this model.In approximate match, conventionally provide pattern string and a corresponding threshold values k, wherein, k has specified the ultimate range allowing between pattern string and matching result.Approximate match is defined as follows: under editing distance model, a given integer k and two character strings are respectively Pattern[1 ... m], Text[1 ... n], find out all substring p in text string Text, make editing distance edit (Pattern, p)≤k.In actual applications, a lot of problems are to have used editing distance model or used its variant.

Existing approximate adaptation method is divided into four classes: the first kind is based on dynamic programming method; Equations of The Second Kind is to search for based on NFA; The 3rd class is based on bit parallel technology, and this is the most successful class methods of current application.Last class is the method based on filtering, it with one fast filter algorithm skip in text those and can not have the successfully bulk region of coupling, and then utilize a non-filter algorithm to verify whether remaining region exists successfully to mate.These class methods are also one of the fastest algorithms.

The algorithm proposing is exactly to utilize the thought of filtering, and the efficiency of algorithm is greatly improved.

Summary of the invention

A kind of method that the object of the present invention is to provide quick approximate pattern matching, especially overcomes the restriction of machine word-length, and a kind of algorithm that is applicable to the approximate match of long pattern string is provided.

For given text string and pattern string, a kind of character string fast matching method based on filter-type of the present invention is in text string, to find the substring with pattern string with certain similarity.String matching problem is exactly in a large text string, to search for the position that occurs of certain pattern string.The present invention has filtered out by filtration thought that in text string, those can not produce the successfully region of coupling, thereby has improved the efficiency of algorithm under average case.

For achieving the above object, the present invention adopts following technical scheme:

A character string fast matching method based on filter-type, comprises the steps:

(1) pattern string is carried out to pre-service: the prefix that note pattern string length is (k+s) h is P ₀, by P ₀be cut into the pattern string sub-block that k+s length is h, the length of each pattern string sub-block is extended to k+q-1, between two continuous mode string sub-blocks, having length is the lap of k+q-1; This k+s extends pattern string sub-block afterwards and is designated as respectively Q ₁, Q ₂..., Q _k+s; Wherein

m is the number of pattern string character; K, for pattern string is transformed into the needed minimal action number of times of certain substring of text string, is called for short editing distance, 0≤k ﹤ m; S is the minimum number with the factor exact matching of pattern string sub-block at matching process Chinese version string indexing (explanation of text string index is in step 2), 1≤s ﹤ m; Q is the length of text string index, q≤h;

(2) create text string index: from text string reference position, q the character that reads successively text string every h length is as text string index, and text string index is labeled as respectively d ₁, d ₂... d _n/h; Q≤h;

(3) establishment coupling array B d, j]: if some text string index d _ibelong to Q _j, B ［ d _i, j]=1; Otherwise B[d _i, j]=0; For each text string creates the array M that length is m ', for the coupling number of recording text string indexing, note d _icorresponding array is M _di[1 ... m '], M _di[1 ... m '] be initialized as 0; Wherein m '=k+s;

(4) calculate the number of mating of the continuous text string index of k+s and pattern string sub-block: text string index is mated with pattern string sub-block; In the process of record matching situation, if there is M _dithe situation of [j]≤j-k (having j>k here), stops building this array, then continues to build the array M of next index _di+1; If element M _di[m ']>=s, need further to detect approximate match, continue next step;

(5) detect approximate match: if there is approximate match, region to be detected is positioned at j-(k+s) h-2k-q+2 to j+m-(k+s-1) the h+k-q part of text string, and detect this region by dynamic programming algorithm, if detect approximate match, provide the position that occurs of approximate match, otherwise report that this region does not exist approximate match.

Character string is the character string being defined on finite alphabet.Text string is a character string, and pattern string is also a character string.String matching problem is exactly in a large text string, to search for the position that occurs of certain pattern string.Given character string x, y and z, claim that x is a prefix of xy, and x is a suffix of yx, and x is a factor of yxz.

Principle of the present invention: if text string is divided into k+s sheet, so for any one successful approximate match, have s sheet at least is exact matching in text string.Utilize above-mentioned thought, the present invention is by the substring number of exact matching in structure q-gram index calculation text string, determine whether have the approximate match of existing in this g-gram index section region may.It is the moving window that q, sliding distance are d that g-gram index adopts length while referring in the time setting up index text Text cutting, and cutting obtains the substring that length is q.In order to calculate the number of mating of q-gram and pattern string, introduce coupling array B[d, j], if some q-gram d belong to Q _j, B[d, j]=1; Otherwise B[d, j]=0.Like this, for each q-gramd, B[d, *] provide the match information of each piece d.

After having determined that there is the possibility of the approximate match of existing in certain section of region, just need to verify this section of region.Application more widely verification algorithm based on bit parallel technology, such as BPM algorithm.But the method has been applicable to the approximate match checking more in short-term of pattern string Length Ratio.The verification algorithm that the present invention chooses is the DP algorithm based on dynamic programming matrix.The thought of this algorithm is: the normal mode string number that success is mated in text is little, therefore, in editing distance matrix M, the value of each row is calculated from top to bottom and is reached soon k+1 (not mated), if the value of certain matrix element is greater than k+1, Search Results just can not depend on this element so.If the value of a matrix element is less than or equal to k, so this element be just called movable.The method only calculates last mobile element of each row, does not need element below to be calculated again.

The present invention, compared with existing approximate match algorithm, has following remarkable advantage:

(1) a large amount of bit parallel methods that use in proximity search, a lot of well results are all to use bit parallel method to obtain.But bit parallel algorithm is applicable to the shorter situation of pattern string, but in current demand, a lot of search are all that long pattern string is carried out.If the length of pattern string has exceeded the word length of computing machine, the effect of coupling is by the bad luck becoming very so.And this approximate search algorithm is applicable to the approximate match of long pattern string very much, in the time of the length m of pattern string and the increase of the ratio of threshold values k, its efficiency also improves accordingly.

(2), before not increasing new filtering policy, the time overhead of approximate match algorithm is mainly calculate array M and verify possible matching area by DP algorithm.By increasing filtering policy, while having simplified coupling, calculate in each q-gram index and pattern string the match time of each, thereby greatly shortened the time of calculating array M.Under best-case, this algorithm can shorten than former algorithm the match time of half.

(3) filter algorithm can only improve the efficiency of algorithm under average case, and its advantage is possible not need to check the each character in text.The efficiency of filter algorithm is relevant with the size text that it can filter out, and to error level sensitivity very.Most of filter algorithms are worked finely in the time that error level is lower, otherwise just very poor.And this filter algorithm is in the time that error level is higher, still keep good performance.Experiment shows, in the time that error level is higher, this algorithm can reduce half the match time more required than similar algorithm.

Brief description of the drawings

Fig. 1 is the processing schematic diagram of structure index to text string;

Fig. 2 is the processing schematic diagram of pattern string.

Embodiment:

In order to understand better the present invention, below in conjunction with embodiment, the invention will be further described, but the scope of protection of present invention is not limited to the scope that embodiment represents.

Case study on implementation 1: SMS classification

In recent years, China had accelerated social informatization and digitized developing steps, and people's every day, information to be processed was more and more.Each application service provider can notify client in the mode of note in order to remind or to allow client understand as early as possible relevant information, as e-business network commercial product recommending, Web bank's information feedback, the various notice of deducting fees also have user's chat message etc. itself.But along with the business of people's contact is more and more, these information become numerous and diverse management that is difficult to.Everybody needs a kind of very convenient, loose, effective mode note of classifying, and breaks away from the predicament of directly opening inbox reply various information.Application the present invention " a kind of character string fast matching method based on filter-type " can realize by the classify function of short breath of masterplate.When user need to divide a new information category, the information of one or more of the type only need to be provided, the note that system just can utilize the way automatic classification of approximate match to receive in the future, applying step is as follows:

(1) user sets up a new info class, and one or some notes sample as such is provided, and these samples are exactly such pattern string.New short message content is as target text string, if certain quasi-mode string and text string match, text string belongs to such.

(2) length of supposing note sample is m=32, the length q=2 of short message text string indexing; It is k=4 that user sets similarity, the minimum number s=2 of the factor exact matching of note index and pattern string sub-block in matching process; Wherein

as shown in Figure 2, pattern string is carried out to pre-service: the prefix that note pattern string length is (k+s) h=24 is P ₀, by P ₀be cut into the pattern string sub-block that k+s=6 length is h=4, the length of each pattern string sub-block is extended to k+q-1=5, between two continuous mode string sub-blocks, having length is the lap of k+q-1=5; This k+s=6 extends pattern string sub-block afterwards and is designated as respectively Q ₁, Q ₂..., Q ₆;

(3) as shown in Figure 1,, for new note, suppose that its length is 128; Create the index of note: from note reference position, as new message index, note index is labeled as respectively d every q=2 character of h=4 length successively short message reading ₁, d ₂... d ₃₂;

(4) create coupling array B[d, j], if some note index d _ibelong to Q _j, B[d _i, j]=1; Otherwise B[d _i, j]=0; For each note creates the array M that length is m ', for recording the coupling number of note index, note d _icorresponding array is M _di[6], M _di[6] be initialized as 0;

(5) calculate the number of mating of the continuous note index of k+s=6 and pattern string sub-block: note index is mated with pattern string sub-block; In the process of record matching situation, if there is M _dithe situation of [b]≤b-4 (having b>4 here), stops building this array, then continues to build the array M of next index _di+1; If element M _di[6]>=2, need further to detect approximate match, continue next step;

(6) detect approximate match: if there is approximate match, region to be detected is arranged in j-32 to the j+14 part (j is certain end position at diary document that once mates continuous 6 last indexes of index) of note, and detect this region by dynamic programming algorithm, if detect approximate match, provide the position that occurs of approximate match, otherwise report that this region does not exist approximate match.A note can belong to multiple classes, just proves that the partial content of this note meets the masterplate requirement of certain class as long as the match is successful.

Case study on implementation 2: diary document is searched for generally

Along with developing rapidly and the quickening of the speed of living of computer technology, with respect to keep a diary, the finish the work situation of document of paper using pen in the past, people are more partial to complete text work with computing machine now.Mainly contain three reasons: 1) computer technology makes people complete text work more easily and be quick; 2) computing machine diary has more amusement color, as editor's mood; 3) meet people's requirement of record conveniently.After finishing than one day, complete one section of diary or work report, instant recording more meets people's demand, along with the development of mobile phone hardware technology, handwritten word identification and speech recognition technology, we are more convenient more directly records daily information, by Computer Recognition Technology, these multimedia messagess will be converted into text and store.

For these more at random, documents that literary is not strong, application the present invention " a kind of quick approximate pattern matching method of filter-type " can meet user and search for the demand of daily record generally, and applying step is as follows:

(1) user specifies diary document as the text string in the present embodiment, and input inquiry statement is as the pattern string in the present invention;

(2) length of query statement is designated as m=16, and user sets inquiry similarity k=2, and in matching process, the minimum number s of the factor exact matching of diary document index and query statement sub-block is 1, the length q=3 of diary document index; ?

query statement is carried out to pre-service: the prefix that the length of note query statement is (k+s) h=12 is P ₀, by P ₀be cut into the query statement sub-block that k+s=3 length is 4, the length of each query statement sub-block is extended to k+q-1=4, between two continuous-query statement sub-blocks, having length is 4 lap; Query statement sub-block after these 3 prolongations is designated as respectively Q ₁, Q ₂, Q ₃;

(3) create diary document index: the length of supposing diary is 1000.From diary document reference position, 3 characters that read successively diary document every 4 characters are as diary document index, and diary document index is labeled as respectively d ₁, d ₂... d ₂₅₀;

(4) create coupling array B[d, j], if some diary document index d _ibelong to Q _j, B[d _i, j]=1; Otherwise B[d _i, j]=0; For the array M that each diary document creation length is 3, for the coupling number of recording diaries document index, remember d _icorresponding array is M _di[3], M _di[3] be initialized as 0;

(5) calculate the number of mating of 3 continuous diary document index and query statement sub-block: diary document index is mated with query statement sub-block; In the process of record matching situation, if there is M _dithe situation of [b]≤b-2 (having b>2 here), stops building this array, then continues to build the array M of next index _di+1; If element M _di[m ']>=s=1, need further to detect approximate match, continue next step;

(6) detect approximate match: if there is approximate match, region to be detected is arranged in j-17 to j+7 (j is certain end position at diary document that once mates continuous 3 last indexes of the index) part of diary document, and detect this region by dynamic programming algorithm, if detect approximate match, provide the position that occurs of approximate match, otherwise report that this region does not exist approximate match.As long as exist the somewhere document that the match is successful to provide user to check in detail as candidate documents.

Claims

1. the character string fast matching method based on filter-type, is characterized in that comprising the steps:

(1) pattern string is carried out to pre-service: the prefix that note pattern string length is (k+s) h is P ₀, by P ₀be cut into the pattern string sub-block that k+s length is h, by the length of each pattern string sub-block extend k+q ?1, between two continuous mode string sub-blocks, exist length be k+q ?1 lap; This k+s extends pattern string sub-block afterwards and is designated as respectively Q ₁, Q ₂..., Q _k+s; Wherein

m is the number of pattern string character; K, for pattern string is transformed into the needed minimal action number of times of certain substring of text string, is called for short editing distance, 0≤k ﹤ m; S is the minimum number at the factor exact matching of matching process Chinese version string indexing and pattern string sub-block, 1≤s ﹤ m; Q is the length of text string index, q≤h;

(2) create text string index: from text string reference position, every h length reads q character of text string successively as text string index, and text string index is labeled as respectively d ₁, d ₂... d _n/h; Q≤h; N is the length of text string;

(3) create coupling array B[d, j]: if some text string index d _ibelong to Q _j, mate array B[d _i, j]=1; Otherwise coupling array B[d _i, j]=0; For each text string creates the array M that length is m ', for the coupling number of recording text string indexing, note d _icorresponding array is M _di[m '], M _di[m '] be initialized as 0; Wherein m '=k+s;

(4) calculate the number of mating of the continuous text string index of k+s and pattern string sub-block: text string index is mated with pattern string sub-block; In the process of record matching situation, if there is M _dithe situation of [j]≤j-k, stops building this array, then continues to build the array M of next index _di+1; If element M _di[m ']>=s, need further to detect approximate match, continue next step;

(5) detect approximate match: if there is approximate match, region to be detected be positioned at text string w ?(k+s) h ?2k ?q+2 to w+m ?(k+s ?1) h+k ?q part, and detect this region by dynamic programming algorithm, if detect approximate match, provide the position that occurs of approximate match, otherwise report that this region does not exist approximate match; Described w is the end position in text string of continuous last index of k+s index during certain once mates;

Described text string is character string, and pattern string is character string.

2. the character string fast matching method based on filter-type according to claim 1, is characterized in that: described character string is the character string being defined on finite alphabet.

3. the character string fast matching method based on filter-type according to claim 1, is characterized in that: given character string x, and y and z, claim that x is a prefix of xy, and x is a suffix of yx, and x is a factor of yxz.