CN103116607A - Full-text retrieval method based on pinyin - Google Patents

Full-text retrieval method based on pinyin Download PDF

Info

Publication number
CN103116607A
CN103116607A CN2013100181055A CN201310018105A CN103116607A CN 103116607 A CN103116607 A CN 103116607A CN 2013100181055 A CN2013100181055 A CN 2013100181055A CN 201310018105 A CN201310018105 A CN 201310018105A CN 103116607 A CN103116607 A CN 103116607A
Authority
CN
China
Prior art keywords
chinese
phonetic
full
pointer
pinyin
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2013100181055A
Other languages
Chinese (zh)
Other versions
CN103116607B (en
Inventor
巩微
银国辉
梁小文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Communication University of China
Original Assignee
Communication University of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Communication University of China filed Critical Communication University of China
Priority to CN201310018105.5A priority Critical patent/CN103116607B/en
Publication of CN103116607A publication Critical patent/CN103116607A/en
Application granted granted Critical
Publication of CN103116607B publication Critical patent/CN103116607B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Document Processing Apparatus (AREA)

Abstract

The invention discloses a novel full-text retrieval method based on pinyin and belongs to the field of full-text retrieval. The difficult point of Chinese full-test retrieval is the complexity of grammar and semantics, a word is taken as a unit when a traditional full-test retrieval method handles the difficult point, and therefore, the research to Chinese retrieval is eventually transferred to the research of words. According to the defects of an existing Chinese full-text retrieval method, the full-text retrieval method based on the pinyin not only solves and avoids Chinese words segmentation, but also improves retrieval speed of a system compared with the existing Chinese full-text retrieval method, and thus user experience degree is improved. The full-text retrieval method based on the pinyin further uses the tone law of the pinyin, retrieval space is compressed and complexity of the system is reduced through tone filtering, and thus the execution efficiency of the system is improved indirectly.

Description

A kind of method based on the phonetic full-text search
Technical field
The present invention relates to the global search technology based on phonetic.
Background technology
At first current full-text search is that document is carried out Chinese Word Segmentation, then sets up index according to participle.This full-text search pattern is mainly to depend on the Chinese word segmentation mode, and the efficient of participle and accuracy directly have influence on the height of full-text search performance index.The analysis of word and processing are the primary factor of Chinese information processing, and the base unit of sentence is word, at first will be divided into a plurality of words to sentence to the processing of sentence, and the word formation of Chinese is different from foreign language.Because the word between the foreign languages such as English, German, French is connected with the space as separator, in Chinese between word and word, word and word without any separator, therefore need to judge that how cutting apart sentence is phrase when processing sentence, same Chinese character can consist of word with previous Chinese character, also can consist of word with a rear Chinese character, bring difficulty to participle.The basic word of any language is not unalterable, and for foreign language, their word itself is easily distinguished and identification the neologisms that increase with various signs, and the characteristics of Chinese character self are difficult to accomplish this point.The complicacy of Chinese is itself characteristics decision, and this makes by computing machine and goes to be difficult to realize from the angle analysis of semanteme.
Summary of the invention
The technical problem to be solved in the present invention is to provide a kind of new global search technology based on phonetic, has avoided Chinese word segmentation, utilizes phonetic to carry out full-text search, has greatly accelerated the retrieval rate of system, has improved user experience.
One, the process of retrieving based on the single document of phonetic:
<1〉user's input
The retrieval of content such as the keyword of reception user input, sentence, paragraph.As " Communication University of China ", " Communication University of China's scholarly education " etc.
<2〉pre-service of the user being inputted
Pre-service comprises keyword extraction, and Chinese character is to the conversion of phonetic.
A) keyword extraction
Here be mainly concerned with the extraction of the keyword of keyword, sentence, paragraph.If user's input is keyword, go to extract according to separators such as space or branches.If user's input is sentence or paragraph, carry out pre-service by the rule of setting up index.
B) Chinese character is to the conversion of phonetic
The keyword that extracts is stored in the term set, carry out Chinese character to the conversion of phonetic for each keyword, so set up at last a phonetic retrieval set of words with the same dimension of keyword retrieval set of words.
<3〉judge the existence of keyword
Whether have this keyword according to phonetic retrieval set of words search index file successively, do not exist if run in query script, withdraw from whole query script, the result that will inquire about specifically is judged to be " not mating ".The existence of also just saying each keyword in pair set is got common factor, if having a keyword not exist judge that whole result for retrieval is " not mating ", otherwise is labeled as " coupling ".
Determine that the retrieving whether keyword mate is as follows:
A) file header with document reads in internal memory
B) according to file header information, the piece of index file is extracted respectively and give different " processor ".
C) " processor " mates according to the keyword pinyin sequence of the summary of the index in block message and control center's transmission.If certain element in pinyin sequence does not exist, should " processor " return to the result of " not mating " to control center in the index summary.Otherwise " processor " beginning is according to extracting successively detail location logical relation corresponding in summary in pinyin sequence.Last corresponding one by one with the position relationship of pinyin sequence.If the match is successful, return to match information to controller, it fails to match otherwise return.
Two, the structure of single document index file:
Full-text search mode based on the Chinese phonetic alphabet need to be that each document generates an index file automatically by computing machine, and the physics size of this file is the physics size less than or equal to the original text document.The effect of this index file is:
(1) with original document physically independently of one another, related in logic.Original document and index file are separate, can leave on different storage mediums, but related each other, and have again unitarity on Search Engine-Oriented.
(2) this index file is under the prerequisite that does not change the document important information, in conjunction with the data structure of data retrievad algorithm definition.For the size of compressed file, improve the speed of retrieval, file adopts binary storage.
Index file is made of file header (piece sum, block entrance address, block length), file body (pinyin term summary entry address, pinyin term length of summarization, the corresponding physical length list of pinyin term).
The invention has the beneficial effects as follows: relatively existing Chinese Full Text Retrieval method, the global search technology based on phonetic that the present invention proposes, avoid the partitioning scheme of Chinese vocabulary, utilize " logicality " of natural language itself to judge relevance and semantic legitimacy between word and word.Simultaneously, the index support of Chinese phonetic alphabet sequence is beneficial to the retrieval rate that putting to good use of binary chop, fast finding algorithm facilitates the quick location of element and then greatly accelerated system, has improved user experience.Also utilize the tone rule of phonetic based on the index of the Chinese phonetic alphabet, by filtration tone compressed index space, reduced the complicacy of system, indirectly the execution efficient of Hoisting System.
Description of drawings
Fig. 1 Chinese character turns the logic diagram of phonetic.
Fig. 2 indexed file structure.
The processing of Fig. 3 " block message "
Embodiment
Reach by reference to the accompanying drawings embodiment, further to being described based on phonetic global search technology scheme that the present invention proposes, its Characteristics and advantages is described more clearly.
The present invention is based on the full-text search principle of the Chinese phonetic alphabet:
One, set up index
(1) document segmentation
Given any one document after being read by the IO mouth by computing machine, need to carry out dividing processing with document.The principle of cutting apart is as follows:
1. block size is no more than 64kb
Index file is taked the binary storage form, and for the size of the compressed index file that maximizes, each of pinyin term and length all adopts 2 bytes to represent.Unified byte number can facilitate the rapid traverse of coding, decoding, pointer.Because the too small meeting of byte number makes piecemeal frequent, byte number is crossed conference and is caused too much byte for empty, forms unnecessary space waste.The computing formula of block size is as follows:
2 161 byte of=65536b accounts for 8, and 2 bytes account for 16, so two bytes can represent 65536b, 1kb=1024b, 65536b/1024=64kb
Can not destroy the integrality of sentence when 2. cutting apart
Due to based on the natural distance between the full-text search dependence Chinese character of the Chinese phonetic alphabet, can not destroy the integrality of sentence in order to avoid make the mistake when therefore whole document segmentation being become piece.Correct operation be cutting during document the moving interval of pointer be set to 64kb, then pointer moves forward the separator of continuous detection sentence, as ", ", ".", "? " Deng.Until detect without sincere symbol segmentation symbol, then starting point is waited for next step to the content of calculating between the end pointer that gets as new " piece ".(Chinese character accounts for two bytes can not to surpass 200 bytes by the movement of the general end pointer of correct custom of Chinese, 100 Chinese characters account for 200 bytes, be no more than 100 words in correct custom a word according to Chinese a separator is arranged certainly), if run into pointer movement greater than this length, pointer reverts to end pointer and namely adds 64kb by initial pointer.As " ... be Communication University of China here, Custom House Welcome to Custom House is visiting and learning here ... .. ".Correct cutting apart is to locate at separator ", ", is one " piecemeal " before ", " comma, is one " piecemeal " after comma.Any other partitioning scheme will destroy the integrality of sentence.Because each piecemeal is to carry out respectively the position statistics, as above-mentioned sentence being divided into " ... be China pass "+" matchmaker university; Custom House Welcome to Custom House is visiting and learning here ... .. " here, " biography " and " matchmaker university " belongs to different " piece ", physical location is uncorrelated, therefore, the word such as " medium ", " medium university " has fallen destroyed.
(2) Chinese character turns phonetic
Filtercondition: the vocabulary of namely stopping using (table that some form without the word of sincere justice)
With whole text traversal once, the Chinese character " translation " that does not meet filtercondition (not at the vocabulary of stopping using) is become not toned phonetic, simultaneously pinyin term and physical location records of values are got off.As: computing machine institute of Communication University of China after conversion is: zhong:1; Guo:2; Chuan:3; Mei:4; Da:5; Xue:6; Ji:7; Suan:8; Ji:9; , the Chinese character or direct filtration of character that meet filtercondition are not namely carried out conversion process.
Concrete transfer process is as follows:
1. moving hand, obtain character
2. check filtering rule
If 3. current character is in filter area, directly jump to step 1.
4. obtain the sexadecimal numerical value of character, carry out binary chop according to the encode Chinese characters for computer table and obtain phonetic.
5. query stacking determines whether this phonetic exists, and finds this to cut in this formation to join the team if exist, if do not find stacked.
6. repeat 1.-5..
(3) pinyin term merges
This process merges identical pinyin term, the storage of physical location numerical value sort ascending.Primary index is: da:1; Wu:27; Li:34; Da:3; Li:41; Be da:1 after merging, 3; Wu:27; Li:34,41;
(4) index entry sequence
This process is to sort according to pinyin term, after the above results sequence is: da:1,3; Li:34,41; Wu:27; (5) calculate every length, the output index file
Above-mentioned description to embodiment is comparatively concrete, not can be understood as the present invention and only limits to above embodiment, and the protection domain of this patent should be as the criterion with claims.

Claims (2)

1. the new text searching method based on phonetic, is characterized in that, by Chinese character is converted into phonetic, thereby avoided Chinese word segmentation.
Utilize phonetic to carry out full-text search, greatly accelerated the retrieval rate of system, improved user experience.
2. in the method for claim 1, the new text searching method based on phonetic that the present invention proposes is described below:
One, set up index
(1) document segmentation
Given any one document after being read by the IO mouth by computing machine, need to carry out dividing processing with document.The principle of cutting apart is as follows:
1. block size is no more than 64kb
Index file is taked the binary storage form, and for the size of the compressed index file that maximizes, each of pinyin term and length all adopts 2 bytes to represent.Unified byte number can facilitate the rapid traverse of coding, decoding, pointer.Because the too small meeting of byte number makes piecemeal frequent, byte number is crossed conference and is caused too much byte for empty, forms unnecessary space waste.The computing formula of block size is as follows:
2 161 byte of=65536b accounts for 8, and 2 bytes account for 16, so two bytes can represent 65536b, 1kb=1024b, 65536b/1024=64kb
Can not destroy the integrality of sentence when 2. cutting apart
Due to based on the natural distance between the full-text search dependence Chinese character of the Chinese phonetic alphabet, can not destroy the integrality of sentence in order to avoid make the mistake when therefore whole document segmentation being become piece.Correct operation be cutting during document the moving interval of pointer be set to 64kb, then pointer moves forward the separator of continuous detection sentence, as ", ", ".", "? " Deng.Until detect without sincere symbol segmentation symbol, then starting point is waited for next step to the content of calculating between the end pointer that gets as new " piece ".(Chinese character accounts for two bytes can not to surpass 200 bytes by the movement of the general end pointer of correct custom of Chinese, 100 Chinese characters account for 200 bytes, be no more than 100 words in correct custom a word according to Chinese a separator is arranged certainly), if run into pointer movement greater than this length, pointer reverts to end pointer and namely adds 64kb by initial pointer.As " ... be Communication University of China here, Custom House Welcome to Custom House is visiting and learning here ... .. ".Correct cutting apart is to locate at separator ", ", is one " piecemeal " before ", " comma, is one " piecemeal " after comma.Any other partitioning scheme will destroy the integrality of sentence.Because each piecemeal is to carry out respectively the position statistics, as above-mentioned sentence being divided into " ... be China pass "+" matchmaker university; Custom House Welcome to Custom House is visiting and learning here ... .. " here, " biography " and " matchmaker university " belongs to different " piece ", physical location is uncorrelated, therefore, the word such as " medium ", " medium university " has fallen destroyed.
(2) Chinese character turns phonetic
Filtercondition: the vocabulary of namely stopping using (table that some form without the word of sincere justice)
With whole text traversal once, the Chinese character " translation " that does not meet filtercondition (not at the vocabulary of stopping using) is become not toned phonetic, simultaneously pinyin term and physical location records of values are got off.As: computing machine institute of Communication University of China after conversion is: zhong:1; Guo:2; Chuan:3; Mei:4; Da:5; Xue:6; Ji:7; Suan:8; Ji:9; , the Chinese character or direct filtration of character that meet filtercondition are not namely carried out conversion process.
Concrete transfer process is as follows:
1. moving hand, obtain character
2. check filtering rule
If 3. current character is in filter area, directly jump to step 1.
4. obtain the sexadecimal numerical value of character, carry out binary chop according to the encode Chinese characters for computer table and obtain phonetic.
5. query stacking determines whether this phonetic exists, and finds this to cut in this formation to join the team if exist, if do not find stacked.
6. repeat 1.-5..
(3) pinyin term merges
This process merges identical pinyin term, the storage of physical location numerical value sort ascending.Primary index is: da:1; Wu:27; Li:34; Da:3; Li:41; Be da:1 after merging, 3; Wu:27; Li:34,41;
(4) index entry sequence
This process is to sort according to pinyin term, after the above results sequence is: da:1,3; Li:34,41; Wu:27;
(5) calculate every length, the output index file.
CN201310018105.5A 2013-01-18 2013-01-18 A kind of text retrieval system based on the Chinese phonetic alphabet newly Expired - Fee Related CN103116607B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310018105.5A CN103116607B (en) 2013-01-18 2013-01-18 A kind of text retrieval system based on the Chinese phonetic alphabet newly

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310018105.5A CN103116607B (en) 2013-01-18 2013-01-18 A kind of text retrieval system based on the Chinese phonetic alphabet newly

Publications (2)

Publication Number Publication Date
CN103116607A true CN103116607A (en) 2013-05-22
CN103116607B CN103116607B (en) 2016-04-13

Family

ID=48414981

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310018105.5A Expired - Fee Related CN103116607B (en) 2013-01-18 2013-01-18 A kind of text retrieval system based on the Chinese phonetic alphabet newly

Country Status (1)

Country Link
CN (1) CN103116607B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107291858A (en) * 2017-06-09 2017-10-24 成都索贝数码科技股份有限公司 Data indexing method based on character string suffix
CN107729351A (en) * 2017-08-29 2018-02-23 天翼爱音乐文化科技有限公司 Multilayer inquiry correcting method and system based on music searching engine
CN107870919A (en) * 2016-09-23 2018-04-03 伊姆西Ip控股有限责任公司 The method and apparatus for managing index

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2009245181A (en) * 2008-03-31 2009-10-22 Nippon Telegr & Teleph Corp <Ntt> Distributed full-text retrieval system, distributed full-text retrieving method, distributed full-text retrieval program and recording medium with the program recorded
CN101930435A (en) * 2009-10-27 2010-12-29 深圳市北科瑞声科技有限公司 Method and system for retrieving organization names
CN102609455A (en) * 2012-01-12 2012-07-25 北京中科大洋科技发展股份有限公司 Method for Chinese homophone searching

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2009245181A (en) * 2008-03-31 2009-10-22 Nippon Telegr & Teleph Corp <Ntt> Distributed full-text retrieval system, distributed full-text retrieving method, distributed full-text retrieval program and recording medium with the program recorded
CN101930435A (en) * 2009-10-27 2010-12-29 深圳市北科瑞声科技有限公司 Method and system for retrieving organization names
CN102609455A (en) * 2012-01-12 2012-07-25 北京中科大洋科技发展股份有限公司 Method for Chinese homophone searching

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107870919A (en) * 2016-09-23 2018-04-03 伊姆西Ip控股有限责任公司 The method and apparatus for managing index
CN107291858A (en) * 2017-06-09 2017-10-24 成都索贝数码科技股份有限公司 Data indexing method based on character string suffix
CN107729351A (en) * 2017-08-29 2018-02-23 天翼爱音乐文化科技有限公司 Multilayer inquiry correcting method and system based on music searching engine

Also Published As

Publication number Publication date
CN103116607B (en) 2016-04-13

Similar Documents

Publication Publication Date Title
CN105426539B (en) A kind of lucene Chinese word cutting method based on dictionary
CN106326303B (en) A kind of spoken semantic analysis system and method
TWI480746B (en) Enabling faster full-text searching using a structured data store
CN102930031B (en) By the method and system extracting bilingual parallel text in webpage
CN107818815B (en) Electronic medical record retrieval method and system
JP2016522524A (en) Method and apparatus for detecting synonymous expressions and searching related contents
CN104281702B (en) Data retrieval method and device based on electric power critical word participle
CN109033307A (en) Word polyarch vector based on CRP cluster indicates and Word sense disambiguation method
CN105608232B (en) A kind of bug knowledge modeling method based on graphic data base
CN103365992B (en) Method for realizing dictionary search of Trie tree based on one-dimensional linear space
CN104199965A (en) Semantic information retrieval method
CN105912570B (en) Resume critical field abstracting method based on hidden Markov model
CN105224518A (en) The lookup method of the computing method of text similarity and system, Similar Text and system
CN102339294B (en) Searching method and system for preprocessing keywords
CN104331446A (en) Memory map-based mass data preprocessing method
CN102867049B (en) Chinese PINYIN quick word segmentation method based on word search tree
WO2012159558A1 (en) Natural language processing method, device and system based on semantic recognition
CN102929902A (en) Character splitting method and device based on Chinese retrieval
CN112256861A (en) Rumor detection method based on search engine return result and electronic device
CN106383814A (en) Word segmentation method of English social media short text
CN101872363B (en) Method for extracting keywords
CN103116607B (en) A kind of text retrieval system based on the Chinese phonetic alphabet newly
CN104572619A (en) Application of intelligent robot interaction system in field of investing and financing
CN102722526B (en) Part-of-speech classification statistics-based duplicate webpage and approximate webpage identification method
Watrin et al. An N-gram frequency database reference to handle MWE extraction in NLP applications

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20160413

Termination date: 20170118

CF01 Termination of patent right due to non-payment of annual fee