CN1542614A

CN1542614A - Bit vector method used for Chinese string matching

Info

Publication number: CN1542614A
Application number: CNA031133800A
Authority: CN
Inventors: 陈开渠; 赵洁; 彭志威
Original assignee: ZTE Corp
Current assignee: ZTE Corp
Priority date: 2003-05-01
Filing date: 2003-05-01
Publication date: 2004-11-03

Abstract

The bit vector matching method for Chinese character string processes Chinese characters in computer expressed with two bytes each through: corresponding two bytes of each Chinese character to one high byte integral and one low byte integral; constituting one high byte array of all the high bytes and one low byte array of all the low bytes; resetting all the integrals in these two arrays; scanning the whole Chinese character mode string and processing each of Chinese characters via the following operation. To some Chinese character in the position of p in the Chinese character mode string, the p-that bit of integral in the low byte array is set to 1, the p-that bit of integral in the high byte array is set to 1 and the integrals in the two arrays are matched with the character string under processing. The technological measures of processing two bytes of each Chinese character separately can reduced required space greatly.

Description

A kind of bit vector method that is used for Chinese character string coupling

Technical field

The present invention relates to a kind of bit vector method that is used for string matching.

Background technology

Fuzzy string matching all has important application in fields such as intrusion detection, mobile short-message filtration, text editing, information inquiry, automatic index, calculation biology, information extractions, has become an important topic of computerized algorithm design.The problem that it solves is: a given character string, a pattern string will be found out all parts similar to pattern string in the character string.

The classical way that solves fuzzy string matching is a kind of based on the method that generates dynamic matrix, and since P.Sellers in 1980 delivered this method, many people improved it.In the middle of these improved, it was exactly the bit vector method that a kind of very effective method is arranged.

In the computer nowadays, the integer word length is generally 32 or 64, therefore, 32 or 64 times bit computing can be finished with an integer arithmetic, thereby arithmetic speed is improved 32 or 64 times.The bit vector method has been utilized this point just, under the situation of pattern string length less than the integer word length, has improved the performance of string matching.

Existing bit vector method

The bit vector method is applicable to that it uses an integer array isometric with character set, and it is corresponding with it that each character all has an integer like this.Being described below of bit vector method:

1) Chinese character is represented with two bytes;

2) the big integer that these two bit combinations are used as one 16 bit is handled, so, each Chinese character

All corresponding big integer, the integer of corresponding all Chinese characters constitutes an array;

3) be clearly 0 with all integers of this array;

4) scan whole pattern string from the beginning to the end, each character to wherein is done as follows: establishing this character present position in pattern string is p, and p bit of the integer of this character correspondence put 1;

5) just using, the corresponding integer in the integer array of the character in the processing character string mates.

Chinese character all is with two byte representations.In GB2312, two bytes of each Chinese character correspondence have been done clearly regulation.Such as, two bytes of 4 Chinese character correspondences that " father of Chen Chen " is comprised are respectively: (below be 16 systems)

Old: B3 C2

Morning: B3 BF

: B5 C4

Father: B0 D6

So " father of Chen Chen " in computing machine is exactly:

B3?C2?B3?BF?B5?C4?B0?D6?B0?D6

Existing bit vector method is handled the big integer (0～65535) that these two bit combinations are used as one 16 bit, so, all corresponding big integer of each Chinese character, such as

" old "=B3C2=46018

" morning "=B3BF=46015

" "=B5C4=46532

" father "=B0D6=45270

Like this, change into bit vector to the Chinese character string, just need reach an array M of 65536.

Because need one and the isometric integer array of character set, and Chinese character has more than 10000, therefore, when being applied to the Chinese character coupling, needs the space of 10000 integers.If integer is made up of 4 bytes, then need the space of 40K byte.

Abroad, because Latin belongs to small size character set, this problem is not obvious.But if what consider is the coupling of Chinese character string, because of Chinese character more than 10,000, also more than the 40K byte, this problem seems more outstanding to requisite space.

Summary of the invention

The technical problem to be solved in the present invention is to need the shortcoming of big quantity space in order to have overcome existing bit vector method when being applied to Chinese, has proposed a kind of new bit vector method.

In computing machine, a Chinese character is represented with two bytes, is referred to herein as low byte and high byte.This method use two long be 256 integer array, be called low byte array and high byte array.Like this, the low byte of each Chinese character all with the low byte array in an integer correspondence, high byte also with the high byte array in an integer correspondence.

New bit vector method of the present invention is, in computing machine with the Chinese character of two byte representations:

(1) the corresponding respectively high byte integer of two bytes and a low byte integer are handled, the high byte integer of all Chinese character correspondences and low byte integer constitute a high byte array and low byte array respectively;

(2) all be clearly 0 with all integers in low byte array and the high byte array;

(3) scan whole Chinese pattern string from the beginning to the end, wherein each Chinese character is done as follows, establishing certain Chinese character present position in Chinese pattern string is p, then

At first, the p bit of the integer in the low byte array of the low byte correspondence of this Chinese character is put 1;

Secondly, the p bit of the integer in the high byte array of the high byte correspondence of this Chinese character is put 1;

(4) just using, the corresponding integer in two integer arrays of the character in the processing character string mates.

One's own department or unit vector approach only need two long be 256 integer array, if integer is 4 bytes, need the 2K byte space altogether, be 5% of original bit vector method.

Compare with original bit vector method, two technical measures that byte is handled respectively owing to having taked Chinese character have significantly reduced required space.

Description of drawings

Fig. 1 is the process flow diagram to the character string processing among the present invention;

Fig. 2 is the integer array of character string " father of Chen Chen " correspondence after original bit vector method is handled.

Fig. 3 is two integer arrays of character string " father of Chen Chen " correspondence after bit vector method of the present invention is handled.

Embodiment

In the present invention, Chinese character is handled as two small integers (0～255).A corresponding high byte integer of Chinese character, a low byte integer.With " father of Chen Chen " is example:

" old " high byte integer=B3=179, low byte integer C2=194

" morning " high byte integer=B3=179, low byte integer BF=191

" " high byte integer=B5=181, low byte integer C4=196

" father " high byte integer=B0=176, low byte integer D6=214

Like this, change into bit vector to the Chinese character string, only need two long be 256 array HM, LM (see figure 3).

Being provided with Chinese character string p and two long is 256 integer array LM, HM.One's own department or unit vector approach can be realized like this:

　　NewPreprocess(p，LM，HM)　　Begin　　  For i＝1 to 256 Do　　  Begin　　    LM[i]＝0；　　    HM[i]＝0；　　  End　　  For i＝1 to m Do　　  Begin　　    LM[low(pi)](i)＝1；　　    HM[high(pi)](i)＝1；　　  End　　End

Wherein, low (p _i) low byte of i character among the expression p.

Fig. 2 has provided the result that character string " father of Chen Chen " is handled through original bit vector method.

Provided the result that character string " father of Chen Chen " is handled through one's own department or unit vector approach among Fig. 3.Can see,

Two integers that " old " is corresponding be " 00011 ” ﹠amp; " 00001 "=" 00001 ".Promptly " old " is " B3C2 " in computing machine, and after handling with bit vector method of the present invention, high byte B3 correspondence " 00011 ", low byte C2 correspondence " 00001 ", this two number are carried out by turn and (﹠amp; ) operation, obtain " 00001 ".And the result that " 00001 " obtains after to be exactly " old " with bit vector method of the prior art handle.The two is unanimity as a result.

Similarly,

That " morning " is corresponding is " 00011 ” ﹠amp; " 00010 "=" 00010 ".

" " corresponding be " 00100 ” ﹠amp; " 00100 "=" 00100 ".

That " father " is corresponding is " 11000 ” ﹠amp; " 11000 "=" 11000 ".

Contrast accompanying drawing 2, the effect that two kinds of methods are described is duplicate.

Below the part corresponding relation among Fig. 2 and Fig. 3 is described.

In Fig. 2,

" old "=B3C2 (16 system)=46018 (10 system).

" old " is first character in character string " father of Chen Chen " lining, so first bit is 1, other are 0.

" morning "=B3BF (16 system)=46015 (10 system), corresponding " 00010 " is that so second bit is 1, other are 0 because " old " is second character in character string " father of Chen Chen " lining.

And for example, " father "=B0D6 (16 system)=45270 (10 system), corresponding " 11000 " are that so fourth, fifth bit is 1, other are 0 because " father " is fourth, fifth character in character string " father of Chen Chen " lining.

In Fig. 3,

C2 is corresponding " 00001 " in LM, and be equivalent to: the low byte of first character of character string " father of Chen Chen " is C2;

B0 is corresponding " 11000 " in HM, and be equivalent to: the high byte of fourth, fifth character of character string " father of Chen Chen " is B0;

B3 is corresponding " 00011 " in HM, and be equivalent to: the high byte of first and second character of character string " father of Chen Chen " is B3;

Claims

1, a kind of bit vector method that is used for Chinese character string coupling, in computing machine with the Chinese character of two byte representations: