CN1694092A

CN1694092A - Method for global search of text containing four-byte character

Info

Publication number: CN1694092A
Application number: CN 200510011824
Authority: CN
Inventors: 赵锋; 王宏源
Original assignee: 王宏源
Current assignee: Wang Fei
Priority date: 2005-05-31
Filing date: 2005-05-31
Publication date: 2005-11-09
Anticipated expiration: 2025-05-31
Also published as: CN1694092B

Abstract

The invention discloses a method of search full text to a text containing four-byte characters, including: firstly, when establishing index, adopt the method of checking characters one by one in letter flows to judge whether the character to be established index is four-byte character, if it is, add the single four-byte character as index unit into the counter-listed indexes; if it isn't four-byte character, ensure keywords through normal words dividing mode of search engine, and add it into counter-listed indexes as a index unit. When searching, firstly adopt the method of checking characters one by one in letter flows of the querying letter string, to judge whether the character to be queried is a four-byte character, and if it is, make the single four-byte character as a query word; if it isn't a four-byte character, ensure keywords through normal words dividing mode of search engine, and all the enquiry words are get together to send into search engine to enquiry. The invention won't affect the indexes establishing speed and the searching speed.

Description

The text that contains four-byte character is carried out the method for full-text search

Technical field

The invention belongs to field of computer technology, particularly a kind of method of the text that contains four-byte character being carried out full-text search.

Background technology

All there is local language different regions in the world, and regional disparity has directly caused the difference of language environment.In the process of an internationalization of exploitation program, the processing of spoken and written languages is most important content.The character that the Computer Processing spoken and written languages are commonly used is generally double byte.So-called double byte is meant that a character will take the position of two BYTE (promptly 16), is called a high position and low level.The encode Chinese characters for computer of China's regulation is GB2312, and its coding is a double byte.The at present nearly all application program that can handle Chinese is all supported GB2312.GB2312 has comprised Chinese characters of level 2 and 9 district's symbols, and high-order low level also is from 0xa1 to Oxfe from 0xa1 to 0xfe, and wherein, the coding range of Chinese character is that 0xb0a1 is to 0xf7fe.

Because the increase of manageable number of characters and widening of usable range are only adopted double byte to carry out character code in some applications and are just seemed not much of that.If for example the total number of characters of Chinese character just can't utilize double byte to manage above 20,000.Therefore also comprise multibyte/wide byte code (Multibytes/Wide Char) mode at present.Popular says, the multibyte coding is exactly outer sign indicating number, is generally Variable Length Code, is mainly used in information stores and exchange; Wide byte code is exactly an ISN, is fixed length code, and corresponding four bytes of a common character are mainly used in information processing.Common multibyte coding has UTF-8, ISO8859 series, GB2312, GBK, EUC-JP etc.

GB18030 is up-to-date Hanzi coded character set national standard, backward compatible GBK and GB2312 standard.GB18030 coding is one, two, the nybble variable-length encoding.One byte part is compatible from 0x0～0x7F and ASCII coding.Two byte parts, first byte are from 0x81～0xFE, and trail byte is from 0x40～0x7E and 0x80～0xFE, and is compatible substantially with the GBK standard.Nybble part, first byte are from 0x81～0xFE, and second byte is from 0x30～0x39, and the scope of third and fourth byte is identical respectively with preceding two bytes.Nybble has partly covered from 0x0080, removes 3.1 yards positions of all Unicode that two bytes part has covered.Unicode has a characteristic: it has comprised all in the world character glyphs.So the mapping relations with Unicode can be set up in each regional language.

Search engine is generally all used the inverted file index structure at present, and carries out index and inquiry based on keyword.The generation method of index is roughly as follows:

1, is provided with two pieces of articles 1 and 2

The content of article 1 is; Tommy lives in Beijing, I live in Beijing too.

The content of article 2 is: He once lived in Shanghai.

2, at first obtain the keyword of these two pieces of articles:

(1) have plenty of article content now, promptly a character string needs will find out earlier all words in the character string, i.e. participle.English word is owing to use space-separated, relatively good processing.Between the Chinese word is the special word segmentation processing of needs that connects together; Usually use two-dimensional grammar (bigram) mode and based on the participle dual mode of vocabulary.

(2) in the article " in ", " once " speech such as " too " do not have any practical significance usually, in the Chinese " " word such as "Yes" do not have concrete implication usually yet, whether on behalf of the speech of notion, these to select to filter out according to different customer requirements;

Can be when (3) user wishes to look into " He " usually containing " he ", the article of " HE " is also found out, so capital and small letter need be unified in all words;

Can be when (4) user wishes to look into " live " usually containing " lives ", the article of " lived " is also found out, so need " lives ", " lived " is reduced into " live ";

(5) punctuation mark in the article also can filter out usually.

3, through after the top processing,

All keywords of article 1 are: [tommy] [live] [beijing] [i] [live] [beijing]

All keywords of article 2 are: [he] [live] [shanghai]

4, set up inverted index.

The corresponding relation of general index is: " article number " is to " all keywords in the article ".Inverted index turns this relation around, becomes: " keyword " is to " have all articles of this keyword number ".Article 1,2 is through becoming behind the row:

Keyword article number

beijing 1

he 2

i 1

live 1，2

shanghai 2

tommy 1

Usually only know keyword occurs not enough in which article, also need to know the position of keyword occurrence number and appearance in article, two kinds of positions are arranged usually: a) character position, promptly writing down this speech is which character in the article; B) keyword position, promptly writing down this speech is which keyword in the article.

After adding " occurrence number " and information such as " position occurring " at last, index structure becomes:

The position appears in keyword article number [occurrence number]

beijing 1[2] 3，6

he 2[1] 1

i 1[1] 4

live 1[2]，2[1] 2，5，2

shanghai 2[1] 3

tommy 1[1] 1

Now with " live " this this structure of behavior example explanation: " live " occurred twice in article 1, in article 2, occurred once, then the position of its appearance is " 2; 5; 2 " wherein " 2,5 " represent two positions that it occurs in article 1, and remaining " 2 " just expression " live " are the 2nd key words in the article 2.

Key word is normally arranged by character sequence.Three row are preserved as lexicon file (TermDictionary), frequency file (frequencies), position paper (positions) respectively above during realization.Wherein lexicon file is not only preserved each keyword, has also kept the pointer that points to frequency file and position paper, can find the frequency information and the positional information of this key word by pointer.

For Chinese, full-text index will solve the problem of language analysis.Western languages such as English are to be base unit with the speech, separate relatively good processing between the word by the space.And be to be base unit with the word in the spoken and written languages such as China, Japan and Korea S. of Asian language, word is side by side, the middle space that does not separate word is so need come out the word segmentation in the statement.

General people need not single character work (si-gram) be indexing units, in order to avoid produce too much invalid Query Result.

Use at present usually two-dimensional grammar (bigram) mode and based on two kinds of participle modes of vocabulary.For example, if cutting " Tian An-men, Beijing ".

Result according to the cutting of two-dimensional grammar (bigram) mode is " capital, Beijing is An Anmen everyday ".Like this, in inquiry, no matter be inquiry " Beijing " or inquiry " Tian An-men ", the inquiry phrase is carried out cutting by same rule: " Beijing " or " day peace ", " peace door ", and between a plurality of keywords,, can correctly be mapped in the corresponding index equally according to composition of relations by logic " with (and) ".This mode as Korean and Japanese, all is general for other Asian languages.

According to the slit mode of vocabulary, the difference of the complexity of the syntax and semantics processing procedure that adopts according to search engine, possible result can be " Tian An-men, Beijing " or " Tian An-men, Beijing " etc.

The language analysis method of at present bigger search engine generally is the method that adopts based on the combination of above two kinds of mechanism.

Existing search engine is subjected to the restriction of aforementioned two-dimensional grammar, can't effectively retrieve the nybble literal.

Summary of the invention

The invention solves the deficiency of above-mentioned four-byte character search method, a kind of search method that can combine with double-byte characters is provided.

Technology contents of the present invention: a kind of method that the text that contains four-byte character is carried out full-text search, its step comprises:

(1) when setting up index, at first in word flow, adopt the method for character examination one by one to judge whether the character that will set up index is four-byte character;

(2) four-byte character in this way, the four-byte character that this is single add in the concordance list and as indexing units sets up inverted index; As not being four-byte character, determine keyword by the participle mode of search engine routine, keyword is added inverted index as indexing units.

(3) in inquiry, at first in word flow, adopt the method for character examination one by one to judge whether character to be checked is four-byte character;

(4) four-byte character in this way, the four-byte character that this is single is as a query word; As not being four-byte character, determine keyword by the participle mode of search engine routine, with the keyword that obtains as a query word;

(5) search engine being sent in aforesaid query word set inquires about.

As character is the double byte literal, and search engine can adopt two-dimensional grammar and determine keyword based on the participle mode that vocabulary combines.

As character is west monocase literal, and search engine can adopt the west literal participle mode of nature to determine keyword.

Mode that can logical with all query words of obtaining connects to form querying condition, sends into search engine.

Technique effect of the present invention: the present invention needn't adopt the base unit of four-byte character as concordance list, thereby can take index space seldom, can not influence index and build the speed of putting and the speed of retrieval.

Embodiment

The present invention revises existing index and retrieving, adds the special treatment method to nybble.Specifically details are as follows:

At first, when setting up index, supposing need be to word flow T ₁, T ₂...., T _nSet up index (each T _iBe a byte).

Read T _i, T _I+1, T _I+2, T _I+3, make W=T _iT _I+1T _I+2T _I+3

If W is a four-byte character, in concordance list, adds index terms W and corresponding inverted index, and make i=i+4;

Otherwise make W=T _iT _I+1, use original segmenting method to set up index, and the value of modification i is to the relevant position.

Repeat said process, finish up to all word contents are all processed.

In inquiry, the retrieval word string is carried out pre-service, carry out participle, concrete grammar is:

Suppose that the retrieval word string is T ₁, T ₂...., T _m, and the result who establishes participle is S set={ W ₁, W ₂...., W _k, W _iIt is a term.

Read T _i, T _I+1, T _I+2, T _I+3, make W=T _iT _I+1T _I+2T _I+3

If W is a four-byte character, makes S=SU{W}, and make i=i+4;

Otherwise make W=T _iT _I+1, use original segmenting method to handle, neologisms are added S set, and the value of modification i is to the relevant position;

Repeat said process, up to all T _iAll dispose;

With W ₁AND W ₂AND....AND W _k(the AND here is a logical AND operator) sent into search engine and retrieved, and that obtain is exactly the result who needs.

The present invention needn't adopt the base unit of four-byte character as index word, thereby only takies index space seldom, and can not influence index and build the speed of putting and the speed of retrieval.

Claims

1, a kind of method that the text that contains four-byte character is carried out full-text search, its step comprises:

(2) four-byte character in this way, the four-byte character that this is single adds inverted index as index terms; As not being four-byte character, determine keyword by the participle mode of search engine routine, keyword is added inverted index as indexing units.

(3) in retrieval, at first in word flow, adopt the method for character examination one by one to judge whether character to be checked is four-byte character;

(5) search engine being sent in aforesaid query word set inquires about.

2, the method that the text that contains four-byte character is carried out full-text search as claimed in claim 1 is characterized in that: as character is the double byte literal, and search engine adopts two-dimensional grammar and determines keyword based on the participle mode that vocabulary combines.

3, the method that the text that contains four-byte character is carried out full-text search as claimed in claim 1 is characterized in that: as character is west monocase literal, and search engine adopts the mode of the west literal participle of nature to determine keyword.

4, the method as the text that contains four-byte character being carried out full-text search as described in the claim 1,2 or 3, it is characterized in that: all query words that will obtain connect to form querying condition in the mode of logical, send into search engine.