CN103116607A

CN103116607A - Full-text retrieval method based on pinyin

Info

Publication number: CN103116607A
Application number: CN2013100181055A
Authority: CN
Inventors: 巩微; 银国辉; 梁小文
Original assignee: Communication University of China
Current assignee: Communication University of China
Priority date: 2013-01-18
Filing date: 2013-01-18
Publication date: 2013-05-22
Anticipated expiration: 2033-01-18
Also published as: CN103116607B

Abstract

The invention discloses a novel full-text retrieval method based on pinyin and belongs to the field of full-text retrieval. The difficult point of Chinese full-test retrieval is the complexity of grammar and semantics, a word is taken as a unit when a traditional full-test retrieval method handles the difficult point, and therefore, the research to Chinese retrieval is eventually transferred to the research of words. According to the defects of an existing Chinese full-text retrieval method, the full-text retrieval method based on the pinyin not only solves and avoids Chinese words segmentation, but also improves retrieval speed of a system compared with the existing Chinese full-text retrieval method, and thus user experience degree is improved. The full-text retrieval method based on the pinyin further uses the tone law of the pinyin, retrieval space is compressed and complexity of the system is reduced through tone filtering, and thus the execution efficiency of the system is improved indirectly.

Description

A kind of method based on the phonetic full-text search

Technical field

The present invention relates to the global search technology based on phonetic.

Background technology

At first current full-text search is that document is carried out Chinese Word Segmentation, then sets up index according to participle.This full-text search pattern is mainly to depend on the Chinese word segmentation mode, and the efficient of participle and accuracy directly have influence on the height of full-text search performance index.The analysis of word and processing are the primary factor of Chinese information processing, and the base unit of sentence is word, at first will be divided into a plurality of words to sentence to the processing of sentence, and the word formation of Chinese is different from foreign language.Because the word between the foreign languages such as English, German, French is connected with the space as separator, in Chinese between word and word, word and word without any separator, therefore need to judge that how cutting apart sentence is phrase when processing sentence, same Chinese character can consist of word with previous Chinese character, also can consist of word with a rear Chinese character, bring difficulty to participle.The basic word of any language is not unalterable, and for foreign language, their word itself is easily distinguished and identification the neologisms that increase with various signs, and the characteristics of Chinese character self are difficult to accomplish this point.The complicacy of Chinese is itself characteristics decision, and this makes by computing machine and goes to be difficult to realize from the angle analysis of semanteme.

Summary of the invention

The technical problem to be solved in the present invention is to provide a kind of new global search technology based on phonetic, has avoided Chinese word segmentation, utilizes phonetic to carry out full-text search, has greatly accelerated the retrieval rate of system, has improved user experience.

One, the process of retrieving based on the single document of phonetic:

＜1〉user's input

The retrieval of content such as the keyword of reception user input, sentence, paragraph.As " Communication University of China ", " Communication University of China's scholarly education " etc.

＜2〉pre-service of the user being inputted

Pre-service comprises keyword extraction, and Chinese character is to the conversion of phonetic.

A) keyword extraction

Here be mainly concerned with the extraction of the keyword of keyword, sentence, paragraph.If user's input is keyword, go to extract according to separators such as space or branches.If user's input is sentence or paragraph, carry out pre-service by the rule of setting up index.

B) Chinese character is to the conversion of phonetic

The keyword that extracts is stored in the term set, carry out Chinese character to the conversion of phonetic for each keyword, so set up at last a phonetic retrieval set of words with the same dimension of keyword retrieval set of words.

＜3〉judge the existence of keyword

Whether have this keyword according to phonetic retrieval set of words search index file successively, do not exist if run in query script, withdraw from whole query script, the result that will inquire about specifically is judged to be " not mating ".The existence of also just saying each keyword in pair set is got common factor, if having a keyword not exist judge that whole result for retrieval is " not mating ", otherwise is labeled as " coupling ".

Determine that the retrieving whether keyword mate is as follows:

A) file header with document reads in internal memory

B) according to file header information, the piece of index file is extracted respectively and give different " processor ".

C) " processor " mates according to the keyword pinyin sequence of the summary of the index in block message and control center's transmission.If certain element in pinyin sequence does not exist, should " processor " return to the result of " not mating " to control center in the index summary.Otherwise " processor " beginning is according to extracting successively detail location logical relation corresponding in summary in pinyin sequence.Last corresponding one by one with the position relationship of pinyin sequence.If the match is successful, return to match information to controller, it fails to match otherwise return.

Two, the structure of single document index file:

Full-text search mode based on the Chinese phonetic alphabet need to be that each document generates an index file automatically by computing machine, and the physics size of this file is the physics size less than or equal to the original text document.The effect of this index file is:

(1) with original document physically independently of one another, related in logic.Original document and index file are separate, can leave on different storage mediums, but related each other, and have again unitarity on Search Engine-Oriented.

(2) this index file is under the prerequisite that does not change the document important information, in conjunction with the data structure of data retrievad algorithm definition.For the size of compressed file, improve the speed of retrieval, file adopts binary storage.

Index file is made of file header (piece sum, block entrance address, block length), file body (pinyin term summary entry address, pinyin term length of summarization, the corresponding physical length list of pinyin term).

The invention has the beneficial effects as follows: relatively existing Chinese Full Text Retrieval method, the global search technology based on phonetic that the present invention proposes, avoid the partitioning scheme of Chinese vocabulary, utilize " logicality " of natural language itself to judge relevance and semantic legitimacy between word and word.Simultaneously, the index support of Chinese phonetic alphabet sequence is beneficial to the retrieval rate that putting to good use of binary chop, fast finding algorithm facilitates the quick location of element and then greatly accelerated system, has improved user experience.Also utilize the tone rule of phonetic based on the index of the Chinese phonetic alphabet, by filtration tone compressed index space, reduced the complicacy of system, indirectly the execution efficient of Hoisting System.

Description of drawings

Fig. 1 Chinese character turns the logic diagram of phonetic.

Fig. 2 indexed file structure.

The processing of Fig. 3 " block message "

Embodiment

Reach by reference to the accompanying drawings embodiment, further to being described based on phonetic global search technology scheme that the present invention proposes, its Characteristics and advantages is described more clearly.

The present invention is based on the full-text search principle of the Chinese phonetic alphabet:

One, set up index

(1) document segmentation

Given any one document after being read by the IO mouth by computing machine, need to carry out dividing processing with document.The principle of cutting apart is as follows:

1. block size is no more than 64kb

Index file is taked the binary storage form, and for the size of the compressed index file that maximizes, each of pinyin term and length all adopts 2 bytes to represent.Unified byte number can facilitate the rapid traverse of coding, decoding, pointer.Because the too small meeting of byte number makes piecemeal frequent, byte number is crossed conference and is caused too much byte for empty, forms unnecessary space waste.The computing formula of block size is as follows:

2 ¹⁶1 byte of=65536b accounts for 8, and 2 bytes account for 16, so two bytes can represent 65536b, 1kb=1024b, 65536b/1024=64kb

Can not destroy the integrality of sentence when 2. cutting apart

Due to based on the natural distance between the full-text search dependence Chinese character of the Chinese phonetic alphabet, can not destroy the integrality of sentence in order to avoid make the mistake when therefore whole document segmentation being become piece.Correct operation be cutting during document the moving interval of pointer be set to 64kb, then pointer moves forward the separator of continuous detection sentence, as ", ", ".", "? " Deng.Until detect without sincere symbol segmentation symbol, then starting point is waited for next step to the content of calculating between the end pointer that gets as new " piece ".(Chinese character accounts for two bytes can not to surpass 200 bytes by the movement of the general end pointer of correct custom of Chinese, 100 Chinese characters account for 200 bytes, be no more than 100 words in correct custom a word according to Chinese a separator is arranged certainly), if run into pointer movement greater than this length, pointer reverts to end pointer and namely adds 64kb by initial pointer.As " ... be Communication University of China here, Custom House Welcome to Custom House is visiting and learning here ... .. ".Correct cutting apart is to locate at separator ", ", is one " piecemeal " before ", " comma, is one " piecemeal " after comma.Any other partitioning scheme will destroy the integrality of sentence.Because each piecemeal is to carry out respectively the position statistics, as above-mentioned sentence being divided into " ... be China pass "+" matchmaker university; Custom House Welcome to Custom House is visiting and learning here ... .. " here, " biography " and " matchmaker university " belongs to different " piece ", physical location is uncorrelated, therefore, the word such as " medium ", " medium university " has fallen destroyed.

(2) Chinese character turns phonetic

Filtercondition: the vocabulary of namely stopping using (table that some form without the word of sincere justice)

With whole text traversal once, the Chinese character " translation " that does not meet filtercondition (not at the vocabulary of stopping using) is become not toned phonetic, simultaneously pinyin term and physical location records of values are got off.As: computing machine institute of Communication University of China after conversion is: zhong:1; Guo:2; Chuan:3; Mei:4; Da:5; Xue:6; Ji:7; Suan:8; Ji:9; , the Chinese character or direct filtration of character that meet filtercondition are not namely carried out conversion process.

Concrete transfer process is as follows:

1. moving hand, obtain character

2. check filtering rule

If 3. current character is in filter area, directly jump to step 1.

4. obtain the sexadecimal numerical value of character, carry out binary chop according to the encode Chinese characters for computer table and obtain phonetic.

5. query stacking determines whether this phonetic exists, and finds this to cut in this formation to join the team if exist, if do not find stacked.

6. repeat 1.-5..

(3) pinyin term merges

This process merges identical pinyin term, the storage of physical location numerical value sort ascending.Primary index is: da:1; Wu:27; Li:34; Da:3; Li:41; Be da:1 after merging, 3; Wu:27; Li:34,41;

(4) index entry sequence

This process is to sort according to pinyin term, after the above results sequence is: da:1,3; Li:34,41; Wu:27; (5) calculate every length, the output index file

Above-mentioned description to embodiment is comparatively concrete, not can be understood as the present invention and only limits to above embodiment, and the protection domain of this patent should be as the criterion with claims.

Claims

1. the new text searching method based on phonetic, is characterized in that, by Chinese character is converted into phonetic, thereby avoided Chinese word segmentation.

Utilize phonetic to carry out full-text search, greatly accelerated the retrieval rate of system, improved user experience.

2. in the method for claim 1, the new text searching method based on phonetic that the present invention proposes is described below: