CN1542614A - Bit vector method used for Chinese string matching - Google Patents
Bit vector method used for Chinese string matching Download PDFInfo
- Publication number
- CN1542614A CN1542614A CNA031133800A CN03113380A CN1542614A CN 1542614 A CN1542614 A CN 1542614A CN A031133800 A CNA031133800 A CN A031133800A CN 03113380 A CN03113380 A CN 03113380A CN 1542614 A CN1542614 A CN 1542614A
- Authority
- CN
- China
- Prior art keywords
- integer
- chinese character
- byte
- character
- chinese
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Landscapes
- Document Processing Apparatus (AREA)
Abstract
The bit vector matching method for Chinese character string processes Chinese characters in computer expressed with two bytes each through: corresponding two bytes of each Chinese character to one high byte integral and one low byte integral; constituting one high byte array of all the high bytes and one low byte array of all the low bytes; resetting all the integrals in these two arrays; scanning the whole Chinese character mode string and processing each of Chinese characters via the following operation. To some Chinese character in the position of p in the Chinese character mode string, the p-that bit of integral in the low byte array is set to 1, the p-that bit of integral in the high byte array is set to 1 and the integrals in the two arrays are matched with the character string under processing. The technological measures of processing two bytes of each Chinese character separately can reduced required space greatly.
Description
Technical field
The present invention relates to a kind of bit vector method that is used for string matching.
Background technology
Fuzzy string matching all has important application in fields such as intrusion detection, mobile short-message filtration, text editing, information inquiry, automatic index, calculation biology, information extractions, has become an important topic of computerized algorithm design.The problem that it solves is: a given character string, a pattern string will be found out all parts similar to pattern string in the character string.
The classical way that solves fuzzy string matching is a kind of based on the method that generates dynamic matrix, and since P.Sellers in 1980 delivered this method, many people improved it.In the middle of these improved, it was exactly the bit vector method that a kind of very effective method is arranged.
In the computer nowadays, the integer word length is generally 32 or 64, therefore, 32 or 64 times bit computing can be finished with an integer arithmetic, thereby arithmetic speed is improved 32 or 64 times.The bit vector method has been utilized this point just, under the situation of pattern string length less than the integer word length, has improved the performance of string matching.
Existing bit vector method
The bit vector method is applicable to that it uses an integer array isometric with character set, and it is corresponding with it that each character all has an integer like this.Being described below of bit vector method:
1) Chinese character is represented with two bytes;
2) the big integer that these two bit combinations are used as one 16 bit is handled, so, each Chinese character
All corresponding big integer, the integer of corresponding all Chinese characters constitutes an array;
3) be clearly 0 with all integers of this array;
4) scan whole pattern string from the beginning to the end, each character to wherein is done as follows: establishing this character present position in pattern string is p, and p bit of the integer of this character correspondence put 1;
5) just using, the corresponding integer in the integer array of the character in the processing character string mates.
Chinese character all is with two byte representations.In GB2312, two bytes of each Chinese character correspondence have been done clearly regulation.Such as, two bytes of 4 Chinese character correspondences that " father of Chen Chen " is comprised are respectively: (below be 16 systems)
Old: B3 C2
Morning: B3 BF
: B5 C4
Father: B0 D6
So " father of Chen Chen " in computing machine is exactly:
B3?C2?B3?BF?B5?C4?B0?D6?B0?D6
Existing bit vector method is handled the big integer (0~65535) that these two bit combinations are used as one 16 bit, so, all corresponding big integer of each Chinese character, such as
" old "=B3C2=46018
" morning "=B3BF=46015
" "=B5C4=46532
" father "=B0D6=45270
Like this, change into bit vector to the Chinese character string, just need reach an array M of 65536.
Because need one and the isometric integer array of character set, and Chinese character has more than 10000, therefore, when being applied to the Chinese character coupling, needs the space of 10000 integers.If integer is made up of 4 bytes, then need the space of 40K byte.
Abroad, because Latin belongs to small size character set, this problem is not obvious.But if what consider is the coupling of Chinese character string, because of Chinese character more than 10,000, also more than the 40K byte, this problem seems more outstanding to requisite space.
Summary of the invention
The technical problem to be solved in the present invention is to need the shortcoming of big quantity space in order to have overcome existing bit vector method when being applied to Chinese, has proposed a kind of new bit vector method.
In computing machine, a Chinese character is represented with two bytes, is referred to herein as low byte and high byte.This method use two long be 256 integer array, be called low byte array and high byte array.Like this, the low byte of each Chinese character all with the low byte array in an integer correspondence, high byte also with the high byte array in an integer correspondence.
New bit vector method of the present invention is, in computing machine with the Chinese character of two byte representations:
(1) the corresponding respectively high byte integer of two bytes and a low byte integer are handled, the high byte integer of all Chinese character correspondences and low byte integer constitute a high byte array and low byte array respectively;
(2) all be clearly 0 with all integers in low byte array and the high byte array;
(3) scan whole Chinese pattern string from the beginning to the end, wherein each Chinese character is done as follows, establishing certain Chinese character present position in Chinese pattern string is p, then
At first, the p bit of the integer in the low byte array of the low byte correspondence of this Chinese character is put 1;
Secondly, the p bit of the integer in the high byte array of the high byte correspondence of this Chinese character is put 1;
(4) just using, the corresponding integer in two integer arrays of the character in the processing character string mates.
One's own department or unit vector approach only need two long be 256 integer array, if integer is 4 bytes, need the 2K byte space altogether, be 5% of original bit vector method.
Compare with original bit vector method, two technical measures that byte is handled respectively owing to having taked Chinese character have significantly reduced required space.
Description of drawings
Fig. 1 is the process flow diagram to the character string processing among the present invention;
Fig. 2 is the integer array of character string " father of Chen Chen " correspondence after original bit vector method is handled.
Fig. 3 is two integer arrays of character string " father of Chen Chen " correspondence after bit vector method of the present invention is handled.
Embodiment
In the present invention, Chinese character is handled as two small integers (0~255).A corresponding high byte integer of Chinese character, a low byte integer.With " father of Chen Chen " is example:
" old " high byte integer=B3=179, low byte integer C2=194
" morning " high byte integer=B3=179, low byte integer BF=191
" " high byte integer=B5=181, low byte integer C4=196
" father " high byte integer=B0=176, low byte integer D6=214
Like this, change into bit vector to the Chinese character string, only need two long be 256 array HM, LM (see figure 3).
Being provided with Chinese character string p and two long is 256 integer array LM, HM.One's own department or unit vector approach can be realized like this:
NewPreprocess(p,LM,HM) Begin For i=1 to 256 Do Begin LM[i]=0; HM[i]=0; End For i=1 to m Do Begin LM[low(pi)](i)=1; HM[high(pi)](i)=1; End End
Wherein, low (p
i) low byte of i character among the expression p.
Fig. 2 has provided the result that character string " father of Chen Chen " is handled through original bit vector method.
Provided the result that character string " father of Chen Chen " is handled through one's own department or unit vector approach among Fig. 3.Can see,
Two integers that " old " is corresponding be " 00011 ” ﹠amp; " 00001 "=" 00001 ".Promptly " old " is " B3C2 " in computing machine, and after handling with bit vector method of the present invention, high byte B3 correspondence " 00011 ", low byte C2 correspondence " 00001 ", this two number are carried out by turn and (﹠amp; ) operation, obtain " 00001 ".And the result that " 00001 " obtains after to be exactly " old " with bit vector method of the prior art handle.The two is unanimity as a result.
Similarly,
That " morning " is corresponding is " 00011 ” ﹠amp; " 00010 "=" 00010 ".
" " corresponding be " 00100 ” ﹠amp; " 00100 "=" 00100 ".
That " father " is corresponding is " 11000 ” ﹠amp; " 11000 "=" 11000 ".
Contrast accompanying drawing 2, the effect that two kinds of methods are described is duplicate.
Below the part corresponding relation among Fig. 2 and Fig. 3 is described.
In Fig. 2,
" old "=B3C2 (16 system)=46018 (10 system).
" old " is first character in character string " father of Chen Chen " lining, so first bit is 1, other are 0.
" morning "=B3BF (16 system)=46015 (10 system), corresponding " 00010 " is that so second bit is 1, other are 0 because " old " is second character in character string " father of Chen Chen " lining.
And for example, " father "=B0D6 (16 system)=45270 (10 system), corresponding " 11000 " are that so fourth, fifth bit is 1, other are 0 because " father " is fourth, fifth character in character string " father of Chen Chen " lining.
In Fig. 3,
C2 is corresponding " 00001 " in LM, and be equivalent to: the low byte of first character of character string " father of Chen Chen " is C2;
B0 is corresponding " 11000 " in HM, and be equivalent to: the high byte of fourth, fifth character of character string " father of Chen Chen " is B0;
B3 is corresponding " 00011 " in HM, and be equivalent to: the high byte of first and second character of character string " father of Chen Chen " is B3;
Claims (1)
1, a kind of bit vector method that is used for Chinese character string coupling, in computing machine with the Chinese character of two byte representations:
(1) the corresponding respectively high byte integer of two bytes and a low byte integer are handled, the high byte integer of all Chinese character correspondences and low byte integer constitute a high byte array and low byte array respectively;
(2) all be clearly 0 with all integers in low byte array and the high byte array;
(3) scan whole Chinese pattern string from the beginning to the end, wherein each Chinese character is done as follows, establishing certain Chinese character present position in Chinese pattern string is p, then
At first, the p bit of the integer in the low byte array of the low byte correspondence of this Chinese character is put 1;
Secondly, the p bit of the integer in the high byte array of the high byte correspondence of this Chinese character is put 1;
(4) just using, the corresponding integer in two integer arrays of the character in the processing character string mates.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CNA031133800A CN1542614A (en) | 2003-05-01 | 2003-05-01 | Bit vector method used for Chinese string matching |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CNA031133800A CN1542614A (en) | 2003-05-01 | 2003-05-01 | Bit vector method used for Chinese string matching |
Publications (1)
Publication Number | Publication Date |
---|---|
CN1542614A true CN1542614A (en) | 2004-11-03 |
Family
ID=34320062
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CNA031133800A Pending CN1542614A (en) | 2003-05-01 | 2003-05-01 | Bit vector method used for Chinese string matching |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN1542614A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110472202A (en) * | 2019-08-12 | 2019-11-19 | 西安空间无线电技术研究所 | A kind of information insertion and extracting method based on Unicode coding |
-
2003
- 2003-05-01 CN CNA031133800A patent/CN1542614A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110472202A (en) * | 2019-08-12 | 2019-11-19 | 西安空间无线电技术研究所 | A kind of information insertion and extracting method based on Unicode coding |
CN110472202B (en) * | 2019-08-12 | 2023-08-01 | 西安空间无线电技术研究所 | Unicode-based information embedding and extracting method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN1713173A (en) | Method and system for converting encoding character set | |
CN1828557A (en) | Process mapping realization method in embedded type operation system | |
CN1825306A (en) | XML data storage and access method based on relational database | |
CN101055593A (en) | Tibetan web page and its code identification method | |
CN1652109A (en) | Method and apparatus replication of binary large object data | |
CN1542614A (en) | Bit vector method used for Chinese string matching | |
CN1492359A (en) | Automatic state machine searching and matching method of multiple key words | |
CN100343851C (en) | Database compression and decompression method | |
CN1658513A (en) | Arithmetic coding decoding method implemented by table look-up | |
CN1694092A (en) | Method for global search of text containing four-byte character | |
CN1131768A (en) | Data processing system and data processing method | |
CN1235386C (en) | Efficient separating method of long distance call area number | |
Kärkkäinen et al. | Engineering external memory LCP array construction: Parallel, in-place and large alphabet | |
CN1067783C (en) | Transfering generation tech. based on Sc grammar | |
CN1275127C (en) | Chinese characters input method according to stroke sequence and keyboard thereof | |
CN1169569A (en) | Character pattern generating apparatus capable of easily generating characters of plurality of different fonts | |
CN1885316A (en) | Data information encoding method | |
CN101052109A (en) | Anti-flash word stock processing method and its application chip | |
CN1095573C (en) | Quick character and word identification method | |
CN1313942C (en) | Method, equipment and system for implementing data processing on operating system level | |
CN1253779C (en) | Chinese characters coding method of coordinates codes and its input keyboards | |
CN1673935A (en) | Jiaguwen (inscriptions on bones or tortoise shells of the Shang Dynasty) computer inputting method | |
CN1632729A (en) | A digital code Chinese character input method and its keyboard | |
CN1107255C (en) | Infinite ordered character set Chinese character whole set method and system | |
CN1838119A (en) | Feedback display method and system for network dictionary retrieve results |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C02 | Deemed withdrawal of patent application after publication (patent law 2001) | ||
WD01 | Invention patent application deemed withdrawn after publication |