CN1694092A - Method for global search of text containing four-byte character - Google Patents

Method for global search of text containing four-byte character Download PDF

Info

Publication number
CN1694092A
CN1694092A CN 200510011824 CN200510011824A CN1694092A CN 1694092 A CN1694092 A CN 1694092A CN 200510011824 CN200510011824 CN 200510011824 CN 200510011824 A CN200510011824 A CN 200510011824A CN 1694092 A CN1694092 A CN 1694092A
Authority
CN
China
Prior art keywords
character
byte
byte character
index
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN 200510011824
Other languages
Chinese (zh)
Other versions
CN1694092B (en
Inventor
赵锋
王宏源
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wang Fei
Original Assignee
王宏源
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 王宏源 filed Critical 王宏源
Priority to CN 200510011824 priority Critical patent/CN1694092B/en
Publication of CN1694092A publication Critical patent/CN1694092A/en
Application granted granted Critical
Publication of CN1694092B publication Critical patent/CN1694092B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The invention discloses a method of search full text to a text containing four-byte characters, including: firstly, when establishing index, adopt the method of checking characters one by one in letter flows to judge whether the character to be established index is four-byte character, if it is, add the single four-byte character as index unit into the counter-listed indexes; if it isn't four-byte character, ensure keywords through normal words dividing mode of search engine, and add it into counter-listed indexes as a index unit. When searching, firstly adopt the method of checking characters one by one in letter flows of the querying letter string, to judge whether the character to be queried is a four-byte character, and if it is, make the single four-byte character as a query word; if it isn't a four-byte character, ensure keywords through normal words dividing mode of search engine, and all the enquiry words are get together to send into search engine to enquiry. The invention won't affect the indexes establishing speed and the searching speed.

Description

The text that contains four-byte character is carried out the method for full-text search
Technical field
The invention belongs to field of computer technology, particularly a kind of method of the text that contains four-byte character being carried out full-text search.
Background technology
All there is local language different regions in the world, and regional disparity has directly caused the difference of language environment.In the process of an internationalization of exploitation program, the processing of spoken and written languages is most important content.The character that the Computer Processing spoken and written languages are commonly used is generally double byte.So-called double byte is meant that a character will take the position of two BYTE (promptly 16), is called a high position and low level.The encode Chinese characters for computer of China's regulation is GB2312, and its coding is a double byte.The at present nearly all application program that can handle Chinese is all supported GB2312.GB2312 has comprised Chinese characters of level 2 and 9 district's symbols, and high-order low level also is from 0xa1 to Oxfe from 0xa1 to 0xfe, and wherein, the coding range of Chinese character is that 0xb0a1 is to 0xf7fe.
Because the increase of manageable number of characters and widening of usable range are only adopted double byte to carry out character code in some applications and are just seemed not much of that.If for example the total number of characters of Chinese character just can't utilize double byte to manage above 20,000.Therefore also comprise multibyte/wide byte code (Multibytes/Wide Char) mode at present.Popular says, the multibyte coding is exactly outer sign indicating number, is generally Variable Length Code, is mainly used in information stores and exchange; Wide byte code is exactly an ISN, is fixed length code, and corresponding four bytes of a common character are mainly used in information processing.Common multibyte coding has UTF-8, ISO8859 series, GB2312, GBK, EUC-JP etc.
GB18030 is up-to-date Hanzi coded character set national standard, backward compatible GBK and GB2312 standard.GB18030 coding is one, two, the nybble variable-length encoding.One byte part is compatible from 0x0~0x7F and ASCII coding.Two byte parts, first byte are from 0x81~0xFE, and trail byte is from 0x40~0x7E and 0x80~0xFE, and is compatible substantially with the GBK standard.Nybble part, first byte are from 0x81~0xFE, and second byte is from 0x30~0x39, and the scope of third and fourth byte is identical respectively with preceding two bytes.Nybble has partly covered from 0x0080, removes 3.1 yards positions of all Unicode that two bytes part has covered.Unicode has a characteristic: it has comprised all in the world character glyphs.So the mapping relations with Unicode can be set up in each regional language.
Search engine is generally all used the inverted file index structure at present, and carries out index and inquiry based on keyword.The generation method of index is roughly as follows:
1, is provided with two pieces of articles 1 and 2
The content of article 1 is; Tommy lives in Beijing, I live in Beijing too.
The content of article 2 is: He once lived in Shanghai.
2, at first obtain the keyword of these two pieces of articles:
(1) have plenty of article content now, promptly a character string needs will find out earlier all words in the character string, i.e. participle.English word is owing to use space-separated, relatively good processing.Between the Chinese word is the special word segmentation processing of needs that connects together; Usually use two-dimensional grammar (bigram) mode and based on the participle dual mode of vocabulary.
(2) in the article " in ", " once " speech such as " too " do not have any practical significance usually, in the Chinese " " word such as "Yes" do not have concrete implication usually yet, whether on behalf of the speech of notion, these to select to filter out according to different customer requirements;
Can be when (3) user wishes to look into " He " usually containing " he ", the article of " HE " is also found out, so capital and small letter need be unified in all words;
Can be when (4) user wishes to look into " live " usually containing " lives ", the article of " lived " is also found out, so need " lives ", " lived " is reduced into " live ";
(5) punctuation mark in the article also can filter out usually.
3, through after the top processing,
All keywords of article 1 are: [tommy] [live] [beijing] [i] [live] [beijing]
All keywords of article 2 are: [he] [live] [shanghai]
4, set up inverted index.
The corresponding relation of general index is: " article number " is to " all keywords in the article ".Inverted index turns this relation around, becomes: " keyword " is to " have all articles of this keyword number ".Article 1,2 is through becoming behind the row:
Keyword article number
beijing 1
he 2
i 1
live 1,2
shanghai 2
tommy 1
Usually only know keyword occurs not enough in which article, also need to know the position of keyword occurrence number and appearance in article, two kinds of positions are arranged usually: a) character position, promptly writing down this speech is which character in the article; B) keyword position, promptly writing down this speech is which keyword in the article.
After adding " occurrence number " and information such as " position occurring " at last, index structure becomes:
The position appears in keyword article number [occurrence number]
beijing 1[2] 3,6
he 2[1] 1
i 1[1] 4
live 1[2],2[1] 2,5,2
shanghai 2[1] 3
tommy 1[1] 1
Now with " live " this this structure of behavior example explanation: " live " occurred twice in article 1, in article 2, occurred once, then the position of its appearance is " 2; 5; 2 " wherein " 2,5 " represent two positions that it occurs in article 1, and remaining " 2 " just expression " live " are the 2nd key words in the article 2.
Key word is normally arranged by character sequence.Three row are preserved as lexicon file (TermDictionary), frequency file (frequencies), position paper (positions) respectively above during realization.Wherein lexicon file is not only preserved each keyword, has also kept the pointer that points to frequency file and position paper, can find the frequency information and the positional information of this key word by pointer.
For Chinese, full-text index will solve the problem of language analysis.Western languages such as English are to be base unit with the speech, separate relatively good processing between the word by the space.And be to be base unit with the word in the spoken and written languages such as China, Japan and Korea S. of Asian language, word is side by side, the middle space that does not separate word is so need come out the word segmentation in the statement.
General people need not single character work (si-gram) be indexing units, in order to avoid produce too much invalid Query Result.
Use at present usually two-dimensional grammar (bigram) mode and based on two kinds of participle modes of vocabulary.For example, if cutting " Tian An-men, Beijing ".
Result according to the cutting of two-dimensional grammar (bigram) mode is " capital, Beijing is An Anmen everyday ".Like this, in inquiry, no matter be inquiry " Beijing " or inquiry " Tian An-men ", the inquiry phrase is carried out cutting by same rule: " Beijing " or " day peace ", " peace door ", and between a plurality of keywords,, can correctly be mapped in the corresponding index equally according to composition of relations by logic " with (and) ".This mode as Korean and Japanese, all is general for other Asian languages.
According to the slit mode of vocabulary, the difference of the complexity of the syntax and semantics processing procedure that adopts according to search engine, possible result can be " Tian An-men, Beijing " or " Tian An-men, Beijing " etc.
The language analysis method of at present bigger search engine generally is the method that adopts based on the combination of above two kinds of mechanism.
Existing search engine is subjected to the restriction of aforementioned two-dimensional grammar, can't effectively retrieve the nybble literal.
Summary of the invention
The invention solves the deficiency of above-mentioned four-byte character search method, a kind of search method that can combine with double-byte characters is provided.
Technology contents of the present invention: a kind of method that the text that contains four-byte character is carried out full-text search, its step comprises:
(1) when setting up index, at first in word flow, adopt the method for character examination one by one to judge whether the character that will set up index is four-byte character;
(2) four-byte character in this way, the four-byte character that this is single add in the concordance list and as indexing units sets up inverted index; As not being four-byte character, determine keyword by the participle mode of search engine routine, keyword is added inverted index as indexing units.
(3) in inquiry, at first in word flow, adopt the method for character examination one by one to judge whether character to be checked is four-byte character;
(4) four-byte character in this way, the four-byte character that this is single is as a query word; As not being four-byte character, determine keyword by the participle mode of search engine routine, with the keyword that obtains as a query word;
(5) search engine being sent in aforesaid query word set inquires about.
As character is the double byte literal, and search engine can adopt two-dimensional grammar and determine keyword based on the participle mode that vocabulary combines.
As character is west monocase literal, and search engine can adopt the west literal participle mode of nature to determine keyword.
Mode that can logical with all query words of obtaining connects to form querying condition, sends into search engine.
Technique effect of the present invention: the present invention needn't adopt the base unit of four-byte character as concordance list, thereby can take index space seldom, can not influence index and build the speed of putting and the speed of retrieval.
Embodiment
The present invention revises existing index and retrieving, adds the special treatment method to nybble.Specifically details are as follows:
At first, when setting up index, supposing need be to word flow T 1, T 2...., T nSet up index (each T iBe a byte).
Read T i, T I+1, T I+2, T I+3, make W=T iT I+1T I+2T I+3
If W is a four-byte character, in concordance list, adds index terms W and corresponding inverted index, and make i=i+4;
Otherwise make W=T iT I+1, use original segmenting method to set up index, and the value of modification i is to the relevant position.
Repeat said process, finish up to all word contents are all processed.
In inquiry, the retrieval word string is carried out pre-service, carry out participle, concrete grammar is:
Suppose that the retrieval word string is T 1, T 2...., T m, and the result who establishes participle is S set={ W 1, W 2...., W k, W iIt is a term.
Read T i, T I+1, T I+2, T I+3, make W=T iT I+1T I+2T I+3
If W is a four-byte character, makes S=SU{W}, and make i=i+4;
Otherwise make W=T iT I+1, use original segmenting method to handle, neologisms are added S set, and the value of modification i is to the relevant position;
Repeat said process, up to all T iAll dispose;
With W 1AND W 2AND....AND W k(the AND here is a logical AND operator) sent into search engine and retrieved, and that obtain is exactly the result who needs.
The present invention needn't adopt the base unit of four-byte character as index word, thereby only takies index space seldom, and can not influence index and build the speed of putting and the speed of retrieval.

Claims (4)

1, a kind of method that the text that contains four-byte character is carried out full-text search, its step comprises:
(1) when setting up index, at first in word flow, adopt the method for character examination one by one to judge whether the character that will set up index is four-byte character;
(2) four-byte character in this way, the four-byte character that this is single adds inverted index as index terms; As not being four-byte character, determine keyword by the participle mode of search engine routine, keyword is added inverted index as indexing units.
(3) in retrieval, at first in word flow, adopt the method for character examination one by one to judge whether character to be checked is four-byte character;
(4) four-byte character in this way, the four-byte character that this is single is as a query word; As not being four-byte character, determine keyword by the participle mode of search engine routine, with the keyword that obtains as a query word;
(5) search engine being sent in aforesaid query word set inquires about.
2, the method that the text that contains four-byte character is carried out full-text search as claimed in claim 1 is characterized in that: as character is the double byte literal, and search engine adopts two-dimensional grammar and determines keyword based on the participle mode that vocabulary combines.
3, the method that the text that contains four-byte character is carried out full-text search as claimed in claim 1 is characterized in that: as character is west monocase literal, and search engine adopts the mode of the west literal participle of nature to determine keyword.
4, the method as the text that contains four-byte character being carried out full-text search as described in the claim 1,2 or 3, it is characterized in that: all query words that will obtain connect to form querying condition in the mode of logical, send into search engine.
CN 200510011824 2005-05-31 2005-05-31 Method for global search of text containing four-byte character Expired - Fee Related CN1694092B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 200510011824 CN1694092B (en) 2005-05-31 2005-05-31 Method for global search of text containing four-byte character

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 200510011824 CN1694092B (en) 2005-05-31 2005-05-31 Method for global search of text containing four-byte character

Publications (2)

Publication Number Publication Date
CN1694092A true CN1694092A (en) 2005-11-09
CN1694092B CN1694092B (en) 2012-10-03

Family

ID=35353055

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 200510011824 Expired - Fee Related CN1694092B (en) 2005-05-31 2005-05-31 Method for global search of text containing four-byte character

Country Status (1)

Country Link
CN (1) CN1694092B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102023989B (en) * 2009-09-23 2012-10-10 阿里巴巴集团控股有限公司 Information retrieval method and system thereof
CN106649312A (en) * 2015-10-29 2017-05-10 北京北方微电子基地设备工艺研究中心有限责任公司 Log file analysis method and system
CN108776705A (en) * 2018-06-12 2018-11-09 厦门市美亚柏科信息股份有限公司 A kind of method, apparatus, equipment and readable medium that text full text is accurately inquired
CN116361421A (en) * 2023-05-30 2023-06-30 互联时刻(北京)信息科技有限公司 Text retrieval method, device and storage medium

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102023989B (en) * 2009-09-23 2012-10-10 阿里巴巴集团控股有限公司 Information retrieval method and system thereof
CN106649312A (en) * 2015-10-29 2017-05-10 北京北方微电子基地设备工艺研究中心有限责任公司 Log file analysis method and system
CN106649312B (en) * 2015-10-29 2019-10-29 北京北方华创微电子装备有限公司 The analysis method and system of journal file
CN108776705A (en) * 2018-06-12 2018-11-09 厦门市美亚柏科信息股份有限公司 A kind of method, apparatus, equipment and readable medium that text full text is accurately inquired
CN108776705B (en) * 2018-06-12 2020-11-17 厦门市美亚柏科信息股份有限公司 Text full-text accurate query method, device, equipment and readable medium
CN116361421A (en) * 2023-05-30 2023-06-30 互联时刻(北京)信息科技有限公司 Text retrieval method, device and storage medium
CN116361421B (en) * 2023-05-30 2023-08-15 互联时刻(北京)信息科技有限公司 Text retrieval method, device and storage medium

Also Published As

Publication number Publication date
CN1694092B (en) 2012-10-03

Similar Documents

Publication Publication Date Title
TWI480746B (en) Enabling faster full-text searching using a structured data store
US7917480B2 (en) Document compression system and method for use with tokenspace repository
EP1779273B1 (en) Multi-stage query processing system and method for use with tokenspace repository
Hsu et al. Space-efficient data structures for top-k completion
US7199729B2 (en) Character code conversion methods and systems
US20030074183A1 (en) Method and system for encoding and accessing linguistic frequency data
CN1282934A (en) Mehtod and system of similar letter selection and document retrieval
CN101075252A (en) Method and system for searching network
WO2006010163A2 (en) User interface and database structure for chinese phrasal stroke and phonetic text input
WO2004109492A1 (en) Object representing and processing method and apparatus
CN1687925A (en) Method for realizing bilingual web page searching
WO2012159558A1 (en) Natural language processing method, device and system based on semantic recognition
US20020022953A1 (en) Indexing and searching ideographic characters on the internet
CN101046809A (en) New word identification method based on association rule model
CN1694092A (en) Method for global search of text containing four-byte character
CN101055593A (en) Tibetan web page and its code identification method
CN101739142B (en) Five-stroke input system and method
CN1342942A (en) Computer recognizing and indexing method of Chinese names
CN103064847A (en) Indexing equipment, indexing method, search device, search method and search system
CN110263339B (en) Retrievable compression and decompression method based on Uyghur syllables
CN1786956A (en) Method for processing converting abnormal word containing unicode four byte code East Asia ideograph in searching engine
CN102722527B (en) Full-text search method supporting search request containing missing symbols
CN1916888A (en) Method and system of identifying language of double-byte character set character data
WO2008089654A1 (en) Ordering retrieving method of chinese character type, device thereof and an information system
CN1825309A (en) Cross-data base searching method based on Unicode encoding

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
ASS Succession or assignment of patent right

Owner name: WANG FEI

Free format text: FORMER OWNER: WANG HONGYUAN

Effective date: 20090410

C41 Transfer of patent application or patent right or utility model
TA01 Transfer of patent application right

Effective date of registration: 20090410

Address after: Beijing City, Chaoyang District Street heading for the small village compound No. 12 room 901 post encoding: 100020

Applicant after: Wang Fei

Address before: Beijing City, Chaoyang District Street heading for the small village compound No. 12 room 901 post encoding: 100020

Applicant before: Wang Hongyuan

C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20121003

Termination date: 20200531

CF01 Termination of patent right due to non-payment of annual fee