CN102750379A

CN102750379A - Fast character string matching method based on filtering type

Info

Publication number: CN102750379A
Application number: CN2012102118297A
Authority: CN
Inventors: 李拥军; 邹少聪; 林浩; 黄格仕; 谢豪
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2012-06-25
Filing date: 2012-06-25
Publication date: 2012-10-24
Anticipated expiration: 2032-06-25
Also published as: CN102750379B

Abstract

The invention discloses a fast character string matching method based on a filtering type. According to the method, a pattern string is preprocessed: the prefix P0 of the pattern string is cut into (k+s) pattern string blocks with the length being h, the length of each pattern string block is lengthened by (k+q-1), and the lengthened pattern string blocks are respectively recorded as Q1, Q2 to Q(k+s); and then, from the initial position of a text string, q characters of the text string are sequentially read every h length and are used as text string indexes, and the text string indexes are respectively marked as d1, d2 to dn/h; a matching number group B[d, j] is recreated: if one text string index di belongs to Qj, the matching number group B[di, j] is equal to 1; and the matching number of (k+s) continuous text string indexes and the pattern string blocks is calculated, and finally, the approximate matching is detected. The method has the advantages that a dynamic programming algorithm and a filter algorithm are combined, in addition, new filter strategies are added, the average time of the approximate matching is shortened, and the matching performance is greatly improved.

Description

A kind of character string fast matching method based on filter-type

Technical field

The present invention relates to the string matching technical field, particularly relate to a kind of character string fast matching method, belong to information retrieval field and calculation biology field based on filter-type.

Background technology

The string matching problem can be defined as and from given symbol sebolic addressing, find out a pattern with certain attribute, and simple example is from given character string, to find out a given character string." approximate match " generally is meant and between permission pattern and text-string, has some subtle difference." coupling " generally is meant " approximate match ".String matching is the most ancient in the computer science, research one of problem the most widely, and the application of string matching also is seen everywhere.In recent years, academia grows with each passing day to the research interest of string matching, particularly in development swift and violent information retrieval field and calculation biology field.Simultaneously, the application of approximate match also more and more widely.For example, in the information retrieval to the correction of misspelling, the sequence alignment in calculation biology, in signal Processing to the correction of error of transmission, or the like.Why above-mentioned phenomenon is arranged, not only because in these two research fields, need the text scale of processing increasing, and because the search that need in text, become increasingly complex.

The research of the coupling of pairing approximation now needs the distance function that can measure two similarity of character string mainly based on distance model in this model.In approximate match, provide the threshold values k of a pattern string and a correspondence usually, wherein, k has specified the ultimate range that allows between pattern string and the matching result.The definition of approximate match is following: under the editing distance model, a given integer k and two character strings are respectively Pattern [1 ... M], Text [1 ... N], find out all the substring p among the text string Text, make editing distance edit (Pattern, p)≤k.In actual applications, a lot of problems are to have used the editing distance model or used its variant.

Existing approximate adaptation method is divided into four types: the first kind is based on dynamic programming method; Second type is based on the NFA search; The 3rd type is based on the bit parallel technology, and this is the most successful class methods of current application.Last type is based on the method for filtration, it with one fast filter algorithm skip in the text those bulk that can not have successful match zones, and then utilize a non-filter algorithm to verify whether remaining zone exists successful match.These class methods also are one of the fastest algorithms.

The algorithm that proposes is exactly to utilize the thought of filtering, and makes the efficient of algorithm greatly improve.

Summary of the invention

The object of the present invention is to provide a kind of method of quick approximate mode coupling, especially overcome the restriction of machine word-length, a kind of algorithm that is applicable to the approximate match of long pattern string is provided.

For given text string and pattern string, a kind of character string fast matching method based on filter-type of the present invention is in text string, to seek the substring that has certain similarity with pattern string.The string matching problem be exactly in a big text string search certain pattern string all the position appears.The present invention has filtered out with filtration thought that those can not produce the zone of successful match in the text string, thereby has improved the efficient of algorithm under average case.

For realizing above-mentioned purpose, the present invention adopts following technical scheme:

A kind of character string fast matching method based on filter-type comprises the steps:

(1) pattern string is carried out pre-service: note pattern string length is P for the prefix of (k+s) h ₀, with P ₀The sub-piece of pattern string that to be cut into k+s length be h prolongs k+q-1 with the length of the sub-piece of each pattern string, and then having length between two continuous mode string pieces is the lap of k+q-1; This k+s prolongs the sub-piece of pattern string afterwards and is designated as Q respectively ₁, Q ₂..., Q _K+sWherein

M is the number of pattern string character; K is called for short editing distance, 0 ≦ k ﹤ m for to be transformed into the needed minimal action number of times of text string sub-strings to pattern string; S is the minimum number of accurately mating in the matching process Chinese version string indexing (explanation of text string index is in step 2) and the factor of the sub-piece of pattern string, 1 ≦ s ﹤ m; Q is the length of text string index, q ≦ h;

(2) string indexing that creates text: begin from the text string reference position, every q character that reads text string successively at a distance from h length is as the text string index, and the text string index is labeled as d respectively ₁, d ₂... D _N/hQ ≦ h;

(3) establishment coupling array B d, j]: if some text string index d _iBelong to Q _j, B ［ d then _i, j]=1; Otherwise B [d _i, j]=0; Creating length for each text string is the array M of m ', is used for the coupling number of recording text string indexing, note d _iCorresponding array is M _Di[1 ... M '], M _Di[1 ... M '] be initialized as 0; M '=k+s wherein;

(4) calculate k+s the continuous text string index and the coupling number of the sub-piece of pattern string: text string index and the sub-piece of pattern string are mated; In the process of record match condition, if M occurs _Di[j]≤j-k (has j > here; K) situation then stops to make up this array, then continues to make up the array M of next index _Di+1If element M _Di[m ']>=s, then need further to detect approximate match, continue next step;

(5) detect approximate match: if there is approximate match; Zone then to be detected is positioned at j-(k+s) h-2k-q+2 to j+m-(k+s-1) the h+k-q part of text string; And detect with dynamic programming algorithm should the zone; If detect approximate match, the position appears in all that then provide approximate match, otherwise should there be approximate match in report in the zone.

Character string is the character string that is defined on the finite alphabet.Text string is a character string, and pattern string also is a character string.The string matching problem be exactly in a big text string search certain pattern string all the position appears.Given character string x, y and z claim that x is the prefix of xy, and x is the suffix of yx, and x is the factor of yxz.

Principle of the present invention: if be divided into the k+s sheet to text string, so for the approximate match of any success, having the s sheet at least is accurately to mate in text string.Whether utilize above-mentioned thought, the present invention is through the accurate substring number of coupling in the structure q-gram index calculation text string, confirm to have in this g-gram index section zone to have the possible of approximate match.Adopting length when the g-gram index refers to when setting up index text Text cutting is that q, sliding distance are the moving window of d, and cutting obtains the substring that length is q.In order to calculate the coupling number of q-gram and pattern string, introduce coupling array B [d, j], if some q-gram d belong to Q _j, B [d, j]=1 then; Otherwise B [d, j]=0.Like this, for each q-gramd, B [d, *] has provided the match information of each piece d.

After having confirmed that there is the possibility that approximate match in certain section zone, just need verify this section zone.Using verification algorithm more widely, to be based on bit parallel technological, such as the BPM algorithm.But being applicable to, this method accomplishes the approximate match checking more in short-term of pattern string length.The verification algorithm that the present invention chooses is based on the DP algorithm of dynamic programming matrix.The thought of this algorithm is: normal mode string number of successful match in text is seldom; Therefore; In the editing distance matrix M; The value of each row is calculated from top to bottom and is reached k+1 (promptly not matched) soon, if the value of certain matrix element greater than k+1, Search Results just can not depend on this element so.If the value of a matrix element is less than or equal to k, so this element just be called movable.This method only calculates last mobile element of each row, does not need element at the back to be calculated again.

The present invention compares with existing approximate match algorithm, has following remarkable advantage:

(1) a large amount of bit parallel methods of using in the proximity search, a lot of well results are to use the bit parallel method to obtain.Yet the bit parallel algorithm is applicable to the situation that pattern string is short, but a lot of search all are that the long pattern string is carried out in the current demand.If the length of pattern string has surpassed the word length of computing machine, the effect bad luck very that will become of coupling so.And this proximity search algorithm is fit to the approximate match of long pattern string very much, and when the ratio of the length m of pattern string and threshold values k increased, its efficient is improve accordingly also.

(2) before not increasing new filtering policy, the time overhead of approximate match algorithm mainly is to calculate array M and verify possible matching area with DP algorithm.Through increasing filtering policy, calculate the match time of each piece in each q-gram index and the pattern string when having simplified coupling, thereby shortened the time of calculating array M greatly.Under best-case, this algorithm can shorten half the match time than former algorithm.

(3) filter algorithm can only improve the efficient of algorithm under average case, and its advantage is that it(?) not possible need check each character in the text.The efficient of filter algorithm is relevant with the size text that it can filter out, and to error level sensitivity very.Most of filter algorithms are worked finely when error level is low, otherwise just very poor.And this filter algorithm has still kept good performance when error level is higher.Experiment shows, when error level was higher, this algorithm can reduce half the match time more required than similar algorithm.

Description of drawings

Fig. 1 is the processing synoptic diagram of structure index to text string;

Fig. 2 is the processing synoptic diagram of pattern string.

Embodiment:

In order to understand the present invention better, below in conjunction with embodiment the present invention is described further, but the scope that the present invention requires to protect is not limited to the scope that embodiment representes.

Case study on implementation 1: SMS classification

In recent years, China had accelerated social informatization and digitized developing steps, and people want information processed more and more every day.Each application service provider can notify the client with way of short messages in order to remind or to let the client understand relevant information as early as possible, also has user's itself chat message etc. like e-business network commercial product recommending, Web bank's information feedback, the various notice of deducting fees.But the business of Along with people's contact is more and more, the numerous and diverse management that is difficult to that becomes of these information.Everybody needs a kind of very convenient, loose, effective and efficient manner classification note, breaks away from the predicament of directly opening inbox reply various information.Using the present invention's " a kind of character string fast matching method based on filter-type " can realize with the classify function of short breath of masterplate.When the user need divide a new information category, the perhaps information of many type only need be provided, the note that system just can utilize the way of approximate match to classify automatically and receive in the future, applying step is following:

(1) user sets up a new info class, and one or some notes sample as such is provided, and these samples are exactly such pattern string.New short message content is then as the target text string, if certain quasi-mode string and text string are complementary then text string belongs to such.

(2) length of supposing the note sample is m=32, the length q=2 of short message text string indexing; It is k=4 that the user sets similarity, the minimum number s=2 that the factor of note index and the sub-piece of pattern string is accurately mated in matching process; Then wherein

As shown in Figure 2, pattern string is carried out pre-service: note pattern string length is P for the prefix of (k+s) h=24 ₀, with P ₀The sub-piece of pattern string that to be cut into k+s=6 length be h=4 prolongs k+q-1=5 with the length of the sub-piece of each pattern string, and then having length between two continuous mode string pieces is the lap of k+q-1=5; This k+s=6 prolongs the sub-piece of pattern string afterwards and is designated as Q respectively ₁, Q ₂..., Q ₆

(3) as shown in Figure 1, for new note, suppose that its length is 128; Create the index of note: begin from the note reference position, every q=2 character at a distance from h=4 length short message reading successively is as the new message index, and the note index is labeled as d respectively ₁, d ₂... D ₃₂

(4) create coupling array B [d, j], if some note index d _iBelong to Q _j, B [d then _i, j]=1; Otherwise B [d _i, j]=0; Creating length for each note is the array M of m ', is used to write down the coupling number of note index, note d _iCorresponding array is M _Di[6], M _Di[6] be initialized as 0;

(5) calculate k+s=6 the continuous note index and the coupling number of the sub-piece of pattern string: note index and the sub-piece of pattern string are mated; In the process of record match condition, if M occurs _Di[b]≤b-4 (has b > here; 4) situation then stops to make up this array, then continues to make up the array M of next index _Di+1If element M _Di[6]>=2, then need further to detect approximate match, continue next step;

(6) detect approximate match: if there is approximate match; Zone then to be detected is arranged in j-32 to the j+14 part (j is certain end position at the diary document that once matees continuous 6 last index of index) of note; And detect with dynamic programming algorithm should the zone; If detect approximate match, the position appears in all that then provide approximate match, otherwise should there be approximate match in report in the zone.A note can belong to a plurality of types, successfully meets certain type masterplate requirement with regard to the partial content that proves this note as long as mate.

Case study on implementation 2: the diary document is searched for generally

Developing rapidly and the quickening of the speed of living of Along with computer technology, with paper pen keep a diary, the finish the work situation of document, people more were partial to accomplish text work with computing machine now with respect in the past.Mainly contain three reasons: 1) computer technology makes people accomplish text work more easily with quick; 2) the computing machine diary has more the amusement color, like editor's mood; 3) satisfy people's requirement of record conveniently.Finish the back than one day and accomplish one piece of diary or work report; Instant recording more meets people's demand; Development along with mobile phone hardware technology, handwritten word identification and speech recognition technology; We are more convenient more directly to write down daily information, and through Computer Recognition Technology, these multimedia messagess will be converted into text and store.

For these documents more at random, that literary is not strong, to use the present invention's " quick approximate mode matching process of a kind of filter-type " and can satisfy the demand that the user searches for daily record generally, applying step is following:

(1) user specifies the diary document as the text string in the present embodiment, and the input inquiry statement is as the pattern string among the present invention;

(2) length of query statement is designated as m=16, and the user sets inquiry similarity k=2, and the minimum number s that the factor of diary document index and the sub-piece of query statement is accurately mated in matching process is 1, the length q=3 of diary document index; Then

Query statement is carried out pre-service: the length of note query statement is P for the prefix of (k+s) h=12 ₀, with P ₀Be cut into k+s=3 length and be 4 the sub-piece of query statement, the length of the sub-piece of each query statement is prolonged k+q-1=4, then having length between two continuous-query sentence pieces is 4 lap; The sub-piece of query statement after these 3 prolongations is designated as Q respectively ₁, Q ₂, Q ₃

(3) create the diary document index: the length of supposing diary is 1000.Begin from diary document reference position, every 3 characters that read the diary document successively at a distance from 4 characters are as the diary document index, and the diary document index is labeled as d respectively ₁, d ₂... d ₂₅₀

(4) create coupling array B [d, j], if some diary document index d _iBelong to Q _j, B [d then _i, j]=1; Otherwise B [d _i, j]=0; For each diary document creation length is 3 array M, be used for the coupling number of recording diaries document index, note d _iCorresponding array is M _Di[3], M _Di[3] be initialized as 0;

(5) the coupling number of 3 continuous diary document index of calculating and the sub-piece of query statement: diary document index and the sub-piece of query statement are mated; In the process of record match condition, if M occurs _Di[b]≤b-2 (has b > here; 2) situation then stops to make up this array, then continues to make up the array M of next index _Di+1If element M _Di[m ']>=s=1, then need further to detect approximate match, continue next step;

(6) detect approximate match: if there is approximate match; Zone then to be detected is arranged in j-17 to j+7 (j is certain end position at the diary document that once matees continuous 3 last index of the index) part of diary document; And detect with dynamic programming algorithm should the zone; If detect approximate match, the position appears in all that then provide approximate match, otherwise should there be approximate match in report in the zone.As long as exist the somewhere document that the match is successful to provide the user to check in detail as candidate documents.

Claims

1. the character string fast matching method based on filter-type is characterized in that comprising the steps:

M is the number of pattern string character; K is called for short editing distance, 0 ≦ k ﹤ m for to be transformed into the needed minimal action number of times of text string sub-strings to pattern string; S is the minimum number of accurately mating in the factor of matching process Chinese version string indexing and the sub-piece of pattern string, 1 ≦ s ﹤ m; Q is the length of text string index, q ≦ h;

(3) create coupling array B [d, j]: if some text string index d _iBelong to Q _j, then mate array B [d _i, j]=1; Otherwise coupling array B [d _i, j]=0; Creating length for each text string is the array M of m ', is used for the coupling number of recording text string indexing, note d _iCorresponding array is M _Di[1...m '], M _Di[1...m '] be initialized as 0; M '=k+s wherein;

(4) calculate k+s the continuous text string index and the coupling number of the sub-piece of pattern string: text string index and the sub-piece of pattern string are mated; In the process of record match condition, if M occurs _DiThe situation of [j]≤j-k then stops to make up this array, then continues to make up the array M of next index _Di+1If element M _Di[m ']>=s, then need further to detect approximate match, continue next step;

2. the character string fast matching method based on filter-type according to claim 1 is characterized in that: said text string is a character string, and pattern string is a character string; Said character string is the character string that is defined on the finite alphabet.

3. the character string fast matching method based on filter-type according to claim 1 is characterized in that: given character string x, and y and z claim that x is the prefix of xy, and x is the suffix of yx, and x is the factor of yxz.