CN1542614A - Bit vector method used for Chinese string matching - Google Patents

Bit vector method used for Chinese string matching Download PDF

Info

Publication number
CN1542614A
CN1542614A CNA031133800A CN03113380A CN1542614A CN 1542614 A CN1542614 A CN 1542614A CN A031133800 A CNA031133800 A CN A031133800A CN 03113380 A CN03113380 A CN 03113380A CN 1542614 A CN1542614 A CN 1542614A
Authority
CN
China
Prior art keywords
integer
chinese character
byte
character
chinese
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CNA031133800A
Other languages
Chinese (zh)
Inventor
陈开渠
赵洁
彭志威
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
ZTE Corp
Original Assignee
ZTE Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by ZTE Corp filed Critical ZTE Corp
Priority to CNA031133800A priority Critical patent/CN1542614A/en
Publication of CN1542614A publication Critical patent/CN1542614A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Document Processing Apparatus (AREA)

Abstract

The bit vector matching method for Chinese character string processes Chinese characters in computer expressed with two bytes each through: corresponding two bytes of each Chinese character to one high byte integral and one low byte integral; constituting one high byte array of all the high bytes and one low byte array of all the low bytes; resetting all the integrals in these two arrays; scanning the whole Chinese character mode string and processing each of Chinese characters via the following operation. To some Chinese character in the position of p in the Chinese character mode string, the p-that bit of integral in the low byte array is set to 1, the p-that bit of integral in the high byte array is set to 1 and the integrals in the two arrays are matched with the character string under processing. The technological measures of processing two bytes of each Chinese character separately can reduced required space greatly.

Description

A kind of bit vector method that is used for Chinese character string coupling
Technical field
The present invention relates to a kind of bit vector method that is used for string matching.
Background technology
Fuzzy string matching all has important application in fields such as intrusion detection, mobile short-message filtration, text editing, information inquiry, automatic index, calculation biology, information extractions, has become an important topic of computerized algorithm design.The problem that it solves is: a given character string, a pattern string will be found out all parts similar to pattern string in the character string.
The classical way that solves fuzzy string matching is a kind of based on the method that generates dynamic matrix, and since P.Sellers in 1980 delivered this method, many people improved it.In the middle of these improved, it was exactly the bit vector method that a kind of very effective method is arranged.
In the computer nowadays, the integer word length is generally 32 or 64, therefore, 32 or 64 times bit computing can be finished with an integer arithmetic, thereby arithmetic speed is improved 32 or 64 times.The bit vector method has been utilized this point just, under the situation of pattern string length less than the integer word length, has improved the performance of string matching.
Existing bit vector method
The bit vector method is applicable to that it uses an integer array isometric with character set, and it is corresponding with it that each character all has an integer like this.Being described below of bit vector method:
1) Chinese character is represented with two bytes;
2) the big integer that these two bit combinations are used as one 16 bit is handled, so, each Chinese character
All corresponding big integer, the integer of corresponding all Chinese characters constitutes an array;
3) be clearly 0 with all integers of this array;
4) scan whole pattern string from the beginning to the end, each character to wherein is done as follows: establishing this character present position in pattern string is p, and p bit of the integer of this character correspondence put 1;
5) just using, the corresponding integer in the integer array of the character in the processing character string mates.
Chinese character all is with two byte representations.In GB2312, two bytes of each Chinese character correspondence have been done clearly regulation.Such as, two bytes of 4 Chinese character correspondences that " father of Chen Chen " is comprised are respectively: (below be 16 systems)
Old: B3 C2
Morning: B3 BF
: B5 C4
Father: B0 D6
So " father of Chen Chen " in computing machine is exactly:
B3?C2?B3?BF?B5?C4?B0?D6?B0?D6
Existing bit vector method is handled the big integer (0~65535) that these two bit combinations are used as one 16 bit, so, all corresponding big integer of each Chinese character, such as
" old "=B3C2=46018
" morning "=B3BF=46015
" "=B5C4=46532
" father "=B0D6=45270
Like this, change into bit vector to the Chinese character string, just need reach an array M of 65536.
Because need one and the isometric integer array of character set, and Chinese character has more than 10000, therefore, when being applied to the Chinese character coupling, needs the space of 10000 integers.If integer is made up of 4 bytes, then need the space of 40K byte.
Abroad, because Latin belongs to small size character set, this problem is not obvious.But if what consider is the coupling of Chinese character string, because of Chinese character more than 10,000, also more than the 40K byte, this problem seems more outstanding to requisite space.
Summary of the invention
The technical problem to be solved in the present invention is to need the shortcoming of big quantity space in order to have overcome existing bit vector method when being applied to Chinese, has proposed a kind of new bit vector method.
In computing machine, a Chinese character is represented with two bytes, is referred to herein as low byte and high byte.This method use two long be 256 integer array, be called low byte array and high byte array.Like this, the low byte of each Chinese character all with the low byte array in an integer correspondence, high byte also with the high byte array in an integer correspondence.
New bit vector method of the present invention is, in computing machine with the Chinese character of two byte representations:
(1) the corresponding respectively high byte integer of two bytes and a low byte integer are handled, the high byte integer of all Chinese character correspondences and low byte integer constitute a high byte array and low byte array respectively;
(2) all be clearly 0 with all integers in low byte array and the high byte array;
(3) scan whole Chinese pattern string from the beginning to the end, wherein each Chinese character is done as follows, establishing certain Chinese character present position in Chinese pattern string is p, then
At first, the p bit of the integer in the low byte array of the low byte correspondence of this Chinese character is put 1;
Secondly, the p bit of the integer in the high byte array of the high byte correspondence of this Chinese character is put 1;
(4) just using, the corresponding integer in two integer arrays of the character in the processing character string mates.
One's own department or unit vector approach only need two long be 256 integer array, if integer is 4 bytes, need the 2K byte space altogether, be 5% of original bit vector method.
Compare with original bit vector method, two technical measures that byte is handled respectively owing to having taked Chinese character have significantly reduced required space.
Description of drawings
Fig. 1 is the process flow diagram to the character string processing among the present invention;
Fig. 2 is the integer array of character string " father of Chen Chen " correspondence after original bit vector method is handled.
Fig. 3 is two integer arrays of character string " father of Chen Chen " correspondence after bit vector method of the present invention is handled.
Embodiment
In the present invention, Chinese character is handled as two small integers (0~255).A corresponding high byte integer of Chinese character, a low byte integer.With " father of Chen Chen " is example:
" old " high byte integer=B3=179, low byte integer C2=194
" morning " high byte integer=B3=179, low byte integer BF=191
" " high byte integer=B5=181, low byte integer C4=196
" father " high byte integer=B0=176, low byte integer D6=214
Like this, change into bit vector to the Chinese character string, only need two long be 256 array HM, LM (see figure 3).
Being provided with Chinese character string p and two long is 256 integer array LM, HM.One's own department or unit vector approach can be realized like this:
  NewPreprocess(p,LM,HM)  Begin    For i=1 to 256 Do    Begin      LM[i]=0;      HM[i]=0;    End    For i=1 to m Do    Begin      LM[low(pi)](i)=1;      HM[high(pi)](i)=1;    End  End
Wherein, low (p i) low byte of i character among the expression p.
Fig. 2 has provided the result that character string " father of Chen Chen " is handled through original bit vector method.
Provided the result that character string " father of Chen Chen " is handled through one's own department or unit vector approach among Fig. 3.Can see,
Two integers that " old " is corresponding be " 00011 ” ﹠amp; " 00001 "=" 00001 ".Promptly " old " is " B3C2 " in computing machine, and after handling with bit vector method of the present invention, high byte B3 correspondence " 00011 ", low byte C2 correspondence " 00001 ", this two number are carried out by turn and (﹠amp; ) operation, obtain " 00001 ".And the result that " 00001 " obtains after to be exactly " old " with bit vector method of the prior art handle.The two is unanimity as a result.
Similarly,
That " morning " is corresponding is " 00011 ” ﹠amp; " 00010 "=" 00010 ".
" " corresponding be " 00100 ” ﹠amp; " 00100 "=" 00100 ".
That " father " is corresponding is " 11000 ” ﹠amp; " 11000 "=" 11000 ".
Contrast accompanying drawing 2, the effect that two kinds of methods are described is duplicate.
Below the part corresponding relation among Fig. 2 and Fig. 3 is described.
In Fig. 2,
" old "=B3C2 (16 system)=46018 (10 system).
" old " is first character in character string " father of Chen Chen " lining, so first bit is 1, other are 0.
" morning "=B3BF (16 system)=46015 (10 system), corresponding " 00010 " is that so second bit is 1, other are 0 because " old " is second character in character string " father of Chen Chen " lining.
And for example, " father "=B0D6 (16 system)=45270 (10 system), corresponding " 11000 " are that so fourth, fifth bit is 1, other are 0 because " father " is fourth, fifth character in character string " father of Chen Chen " lining.
In Fig. 3,
C2 is corresponding " 00001 " in LM, and be equivalent to: the low byte of first character of character string " father of Chen Chen " is C2;
B0 is corresponding " 11000 " in HM, and be equivalent to: the high byte of fourth, fifth character of character string " father of Chen Chen " is B0;
B3 is corresponding " 00011 " in HM, and be equivalent to: the high byte of first and second character of character string " father of Chen Chen " is B3;

Claims (1)

1, a kind of bit vector method that is used for Chinese character string coupling, in computing machine with the Chinese character of two byte representations:
(1) the corresponding respectively high byte integer of two bytes and a low byte integer are handled, the high byte integer of all Chinese character correspondences and low byte integer constitute a high byte array and low byte array respectively;
(2) all be clearly 0 with all integers in low byte array and the high byte array;
(3) scan whole Chinese pattern string from the beginning to the end, wherein each Chinese character is done as follows, establishing certain Chinese character present position in Chinese pattern string is p, then
At first, the p bit of the integer in the low byte array of the low byte correspondence of this Chinese character is put 1;
Secondly, the p bit of the integer in the high byte array of the high byte correspondence of this Chinese character is put 1;
(4) just using, the corresponding integer in two integer arrays of the character in the processing character string mates.
CNA031133800A 2003-05-01 2003-05-01 Bit vector method used for Chinese string matching Pending CN1542614A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNA031133800A CN1542614A (en) 2003-05-01 2003-05-01 Bit vector method used for Chinese string matching

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNA031133800A CN1542614A (en) 2003-05-01 2003-05-01 Bit vector method used for Chinese string matching

Publications (1)

Publication Number Publication Date
CN1542614A true CN1542614A (en) 2004-11-03

Family

ID=34320062

Family Applications (1)

Application Number Title Priority Date Filing Date
CNA031133800A Pending CN1542614A (en) 2003-05-01 2003-05-01 Bit vector method used for Chinese string matching

Country Status (1)

Country Link
CN (1) CN1542614A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110472202A (en) * 2019-08-12 2019-11-19 西安空间无线电技术研究所 A kind of information insertion and extracting method based on Unicode coding

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110472202A (en) * 2019-08-12 2019-11-19 西安空间无线电技术研究所 A kind of information insertion and extracting method based on Unicode coding
CN110472202B (en) * 2019-08-12 2023-08-01 西安空间无线电技术研究所 Unicode-based information embedding and extracting method

Similar Documents

Publication Publication Date Title
CN1713173A (en) Method and system for converting encoding character set
CN1828557A (en) Process mapping realization method in embedded type operation system
CN1825306A (en) XML data storage and access method based on relational database
CN101055593A (en) Tibetan web page and its code identification method
CN1652109A (en) Method and apparatus replication of binary large object data
CN1542614A (en) Bit vector method used for Chinese string matching
CN1492359A (en) Automatic state machine searching and matching method of multiple key words
CN100343851C (en) Database compression and decompression method
CN1658513A (en) Arithmetic coding decoding method implemented by table look-up
CN1694092A (en) Method for global search of text containing four-byte character
CN1131768A (en) Data processing system and data processing method
CN1235386C (en) Efficient separating method of long distance call area number
Kärkkäinen et al. Engineering external memory LCP array construction: Parallel, in-place and large alphabet
CN1067783C (en) Transfering generation tech. based on Sc grammar
CN1275127C (en) Chinese characters input method according to stroke sequence and keyboard thereof
CN1169569A (en) Character pattern generating apparatus capable of easily generating characters of plurality of different fonts
CN1885316A (en) Data information encoding method
CN101052109A (en) Anti-flash word stock processing method and its application chip
CN1095573C (en) Quick character and word identification method
CN1313942C (en) Method, equipment and system for implementing data processing on operating system level
CN1253779C (en) Chinese characters coding method of coordinates codes and its input keyboards
CN1673935A (en) Jiaguwen (inscriptions on bones or tortoise shells of the Shang Dynasty) computer inputting method
CN1632729A (en) A digital code Chinese character input method and its keyboard
CN1107255C (en) Infinite ordered character set Chinese character whole set method and system
CN1838119A (en) Feedback display method and system for network dictionary retrieve results

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication