CN103116607A - Full-text retrieval method based on pinyin - Google Patents
Full-text retrieval method based on pinyin Download PDFInfo
- Publication number
- CN103116607A CN103116607A CN2013100181055A CN201310018105A CN103116607A CN 103116607 A CN103116607 A CN 103116607A CN 2013100181055 A CN2013100181055 A CN 2013100181055A CN 201310018105 A CN201310018105 A CN 201310018105A CN 103116607 A CN103116607 A CN 103116607A
- Authority
- CN
- China
- Prior art keywords
- chinese
- phonetic
- full
- pointer
- pinyin
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Landscapes
- Document Processing Apparatus (AREA)
Abstract
The invention discloses a novel full-text retrieval method based on pinyin and belongs to the field of full-text retrieval. The difficult point of Chinese full-test retrieval is the complexity of grammar and semantics, a word is taken as a unit when a traditional full-test retrieval method handles the difficult point, and therefore, the research to Chinese retrieval is eventually transferred to the research of words. According to the defects of an existing Chinese full-text retrieval method, the full-text retrieval method based on the pinyin not only solves and avoids Chinese words segmentation, but also improves retrieval speed of a system compared with the existing Chinese full-text retrieval method, and thus user experience degree is improved. The full-text retrieval method based on the pinyin further uses the tone law of the pinyin, retrieval space is compressed and complexity of the system is reduced through tone filtering, and thus the execution efficiency of the system is improved indirectly.
Description
Technical field
The present invention relates to the global search technology based on phonetic.
Background technology
At first current full-text search is that document is carried out Chinese Word Segmentation, then sets up index according to participle.This full-text search pattern is mainly to depend on the Chinese word segmentation mode, and the efficient of participle and accuracy directly have influence on the height of full-text search performance index.The analysis of word and processing are the primary factor of Chinese information processing, and the base unit of sentence is word, at first will be divided into a plurality of words to sentence to the processing of sentence, and the word formation of Chinese is different from foreign language.Because the word between the foreign languages such as English, German, French is connected with the space as separator, in Chinese between word and word, word and word without any separator, therefore need to judge that how cutting apart sentence is phrase when processing sentence, same Chinese character can consist of word with previous Chinese character, also can consist of word with a rear Chinese character, bring difficulty to participle.The basic word of any language is not unalterable, and for foreign language, their word itself is easily distinguished and identification the neologisms that increase with various signs, and the characteristics of Chinese character self are difficult to accomplish this point.The complicacy of Chinese is itself characteristics decision, and this makes by computing machine and goes to be difficult to realize from the angle analysis of semanteme.
Summary of the invention
The technical problem to be solved in the present invention is to provide a kind of new global search technology based on phonetic, has avoided Chinese word segmentation, utilizes phonetic to carry out full-text search, has greatly accelerated the retrieval rate of system, has improved user experience.
One, the process of retrieving based on the single document of phonetic:
<1〉user's input
The retrieval of content such as the keyword of reception user input, sentence, paragraph.As " Communication University of China ", " Communication University of China's scholarly education " etc.
<2〉pre-service of the user being inputted
Pre-service comprises keyword extraction, and Chinese character is to the conversion of phonetic.
A) keyword extraction
Here be mainly concerned with the extraction of the keyword of keyword, sentence, paragraph.If user's input is keyword, go to extract according to separators such as space or branches.If user's input is sentence or paragraph, carry out pre-service by the rule of setting up index.
B) Chinese character is to the conversion of phonetic
The keyword that extracts is stored in the term set, carry out Chinese character to the conversion of phonetic for each keyword, so set up at last a phonetic retrieval set of words with the same dimension of keyword retrieval set of words.
<3〉judge the existence of keyword
Whether have this keyword according to phonetic retrieval set of words search index file successively, do not exist if run in query script, withdraw from whole query script, the result that will inquire about specifically is judged to be " not mating ".The existence of also just saying each keyword in pair set is got common factor, if having a keyword not exist judge that whole result for retrieval is " not mating ", otherwise is labeled as " coupling ".
Determine that the retrieving whether keyword mate is as follows:
A) file header with document reads in internal memory
B) according to file header information, the piece of index file is extracted respectively and give different " processor ".
C) " processor " mates according to the keyword pinyin sequence of the summary of the index in block message and control center's transmission.If certain element in pinyin sequence does not exist, should " processor " return to the result of " not mating " to control center in the index summary.Otherwise " processor " beginning is according to extracting successively detail location logical relation corresponding in summary in pinyin sequence.Last corresponding one by one with the position relationship of pinyin sequence.If the match is successful, return to match information to controller, it fails to match otherwise return.
Two, the structure of single document index file:
Full-text search mode based on the Chinese phonetic alphabet need to be that each document generates an index file automatically by computing machine, and the physics size of this file is the physics size less than or equal to the original text document.The effect of this index file is:
(1) with original document physically independently of one another, related in logic.Original document and index file are separate, can leave on different storage mediums, but related each other, and have again unitarity on Search Engine-Oriented.
(2) this index file is under the prerequisite that does not change the document important information, in conjunction with the data structure of data retrievad algorithm definition.For the size of compressed file, improve the speed of retrieval, file adopts binary storage.
Index file is made of file header (piece sum, block entrance address, block length), file body (pinyin term summary entry address, pinyin term length of summarization, the corresponding physical length list of pinyin term).
The invention has the beneficial effects as follows: relatively existing Chinese Full Text Retrieval method, the global search technology based on phonetic that the present invention proposes, avoid the partitioning scheme of Chinese vocabulary, utilize " logicality " of natural language itself to judge relevance and semantic legitimacy between word and word.Simultaneously, the index support of Chinese phonetic alphabet sequence is beneficial to the retrieval rate that putting to good use of binary chop, fast finding algorithm facilitates the quick location of element and then greatly accelerated system, has improved user experience.Also utilize the tone rule of phonetic based on the index of the Chinese phonetic alphabet, by filtration tone compressed index space, reduced the complicacy of system, indirectly the execution efficient of Hoisting System.
Description of drawings
Fig. 1 Chinese character turns the logic diagram of phonetic.
Fig. 2 indexed file structure.
The processing of Fig. 3 " block message "
Embodiment
Reach by reference to the accompanying drawings embodiment, further to being described based on phonetic global search technology scheme that the present invention proposes, its Characteristics and advantages is described more clearly.
The present invention is based on the full-text search principle of the Chinese phonetic alphabet:
One, set up index
(1) document segmentation
Given any one document after being read by the IO mouth by computing machine, need to carry out dividing processing with document.The principle of cutting apart is as follows:
1. block size is no more than 64kb
Index file is taked the binary storage form, and for the size of the compressed index file that maximizes, each of pinyin term and length all adopts 2 bytes to represent.Unified byte number can facilitate the rapid traverse of coding, decoding, pointer.Because the too small meeting of byte number makes piecemeal frequent, byte number is crossed conference and is caused too much byte for empty, forms unnecessary space waste.The computing formula of block size is as follows:
2
161 byte of=65536b accounts for 8, and 2 bytes account for 16, so two bytes can represent 65536b, 1kb=1024b, 65536b/1024=64kb
Can not destroy the integrality of sentence when 2. cutting apart
Due to based on the natural distance between the full-text search dependence Chinese character of the Chinese phonetic alphabet, can not destroy the integrality of sentence in order to avoid make the mistake when therefore whole document segmentation being become piece.Correct operation be cutting during document the moving interval of pointer be set to 64kb, then pointer moves forward the separator of continuous detection sentence, as ", ", ".", "? " Deng.Until detect without sincere symbol segmentation symbol, then starting point is waited for next step to the content of calculating between the end pointer that gets as new " piece ".(Chinese character accounts for two bytes can not to surpass 200 bytes by the movement of the general end pointer of correct custom of Chinese, 100 Chinese characters account for 200 bytes, be no more than 100 words in correct custom a word according to Chinese a separator is arranged certainly), if run into pointer movement greater than this length, pointer reverts to end pointer and namely adds 64kb by initial pointer.As " ... be Communication University of China here, Custom House Welcome to Custom House is visiting and learning here ... .. ".Correct cutting apart is to locate at separator ", ", is one " piecemeal " before ", " comma, is one " piecemeal " after comma.Any other partitioning scheme will destroy the integrality of sentence.Because each piecemeal is to carry out respectively the position statistics, as above-mentioned sentence being divided into " ... be China pass "+" matchmaker university; Custom House Welcome to Custom House is visiting and learning here ... .. " here, " biography " and " matchmaker university " belongs to different " piece ", physical location is uncorrelated, therefore, the word such as " medium ", " medium university " has fallen destroyed.
(2) Chinese character turns phonetic
Filtercondition: the vocabulary of namely stopping using (table that some form without the word of sincere justice)
With whole text traversal once, the Chinese character " translation " that does not meet filtercondition (not at the vocabulary of stopping using) is become not toned phonetic, simultaneously pinyin term and physical location records of values are got off.As: computing machine institute of Communication University of China after conversion is: zhong:1; Guo:2; Chuan:3; Mei:4; Da:5; Xue:6; Ji:7; Suan:8; Ji:9; , the Chinese character or direct filtration of character that meet filtercondition are not namely carried out conversion process.
Concrete transfer process is as follows:
1. moving hand, obtain character
2. check filtering rule
If 3. current character is in filter area, directly jump to step 1.
4. obtain the sexadecimal numerical value of character, carry out binary chop according to the encode Chinese characters for computer table and obtain phonetic.
5. query stacking determines whether this phonetic exists, and finds this to cut in this formation to join the team if exist, if do not find stacked.
6. repeat 1.-5..
(3) pinyin term merges
This process merges identical pinyin term, the storage of physical location numerical value sort ascending.Primary index is: da:1; Wu:27; Li:34; Da:3; Li:41; Be da:1 after merging, 3; Wu:27; Li:34,41;
(4) index entry sequence
This process is to sort according to pinyin term, after the above results sequence is: da:1,3; Li:34,41; Wu:27; (5) calculate every length, the output index file
Above-mentioned description to embodiment is comparatively concrete, not can be understood as the present invention and only limits to above embodiment, and the protection domain of this patent should be as the criterion with claims.
Claims (2)
1. the new text searching method based on phonetic, is characterized in that, by Chinese character is converted into phonetic, thereby avoided Chinese word segmentation.
Utilize phonetic to carry out full-text search, greatly accelerated the retrieval rate of system, improved user experience.
2. in the method for claim 1, the new text searching method based on phonetic that the present invention proposes is described below:
One, set up index
(1) document segmentation
Given any one document after being read by the IO mouth by computing machine, need to carry out dividing processing with document.The principle of cutting apart is as follows:
1. block size is no more than 64kb
Index file is taked the binary storage form, and for the size of the compressed index file that maximizes, each of pinyin term and length all adopts 2 bytes to represent.Unified byte number can facilitate the rapid traverse of coding, decoding, pointer.Because the too small meeting of byte number makes piecemeal frequent, byte number is crossed conference and is caused too much byte for empty, forms unnecessary space waste.The computing formula of block size is as follows:
2
161 byte of=65536b accounts for 8, and 2 bytes account for 16, so two bytes can represent 65536b, 1kb=1024b, 65536b/1024=64kb
Can not destroy the integrality of sentence when 2. cutting apart
Due to based on the natural distance between the full-text search dependence Chinese character of the Chinese phonetic alphabet, can not destroy the integrality of sentence in order to avoid make the mistake when therefore whole document segmentation being become piece.Correct operation be cutting during document the moving interval of pointer be set to 64kb, then pointer moves forward the separator of continuous detection sentence, as ", ", ".", "? " Deng.Until detect without sincere symbol segmentation symbol, then starting point is waited for next step to the content of calculating between the end pointer that gets as new " piece ".(Chinese character accounts for two bytes can not to surpass 200 bytes by the movement of the general end pointer of correct custom of Chinese, 100 Chinese characters account for 200 bytes, be no more than 100 words in correct custom a word according to Chinese a separator is arranged certainly), if run into pointer movement greater than this length, pointer reverts to end pointer and namely adds 64kb by initial pointer.As " ... be Communication University of China here, Custom House Welcome to Custom House is visiting and learning here ... .. ".Correct cutting apart is to locate at separator ", ", is one " piecemeal " before ", " comma, is one " piecemeal " after comma.Any other partitioning scheme will destroy the integrality of sentence.Because each piecemeal is to carry out respectively the position statistics, as above-mentioned sentence being divided into " ... be China pass "+" matchmaker university; Custom House Welcome to Custom House is visiting and learning here ... .. " here, " biography " and " matchmaker university " belongs to different " piece ", physical location is uncorrelated, therefore, the word such as " medium ", " medium university " has fallen destroyed.
(2) Chinese character turns phonetic
Filtercondition: the vocabulary of namely stopping using (table that some form without the word of sincere justice)
With whole text traversal once, the Chinese character " translation " that does not meet filtercondition (not at the vocabulary of stopping using) is become not toned phonetic, simultaneously pinyin term and physical location records of values are got off.As: computing machine institute of Communication University of China after conversion is: zhong:1; Guo:2; Chuan:3; Mei:4; Da:5; Xue:6; Ji:7; Suan:8; Ji:9; , the Chinese character or direct filtration of character that meet filtercondition are not namely carried out conversion process.
Concrete transfer process is as follows:
1. moving hand, obtain character
2. check filtering rule
If 3. current character is in filter area, directly jump to step 1.
4. obtain the sexadecimal numerical value of character, carry out binary chop according to the encode Chinese characters for computer table and obtain phonetic.
5. query stacking determines whether this phonetic exists, and finds this to cut in this formation to join the team if exist, if do not find stacked.
6. repeat 1.-5..
(3) pinyin term merges
This process merges identical pinyin term, the storage of physical location numerical value sort ascending.Primary index is: da:1; Wu:27; Li:34; Da:3; Li:41; Be da:1 after merging, 3; Wu:27; Li:34,41;
(4) index entry sequence
This process is to sort according to pinyin term, after the above results sequence is: da:1,3; Li:34,41; Wu:27;
(5) calculate every length, the output index file.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310018105.5A CN103116607B (en) | 2013-01-18 | 2013-01-18 | A kind of text retrieval system based on the Chinese phonetic alphabet newly |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310018105.5A CN103116607B (en) | 2013-01-18 | 2013-01-18 | A kind of text retrieval system based on the Chinese phonetic alphabet newly |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103116607A true CN103116607A (en) | 2013-05-22 |
CN103116607B CN103116607B (en) | 2016-04-13 |
Family
ID=48414981
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201310018105.5A Expired - Fee Related CN103116607B (en) | 2013-01-18 | 2013-01-18 | A kind of text retrieval system based on the Chinese phonetic alphabet newly |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103116607B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107291858A (en) * | 2017-06-09 | 2017-10-24 | 成都索贝数码科技股份有限公司 | Data indexing method based on character string suffix |
CN107729351A (en) * | 2017-08-29 | 2018-02-23 | 天翼爱音乐文化科技有限公司 | Multilayer inquiry correcting method and system based on music searching engine |
CN107870919A (en) * | 2016-09-23 | 2018-04-03 | 伊姆西Ip控股有限责任公司 | The method and apparatus for managing index |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2009245181A (en) * | 2008-03-31 | 2009-10-22 | Nippon Telegr & Teleph Corp <Ntt> | Distributed full-text retrieval system, distributed full-text retrieving method, distributed full-text retrieval program and recording medium with the program recorded |
CN101930435A (en) * | 2009-10-27 | 2010-12-29 | 深圳市北科瑞声科技有限公司 | Method and system for retrieving organization names |
CN102609455A (en) * | 2012-01-12 | 2012-07-25 | 北京中科大洋科技发展股份有限公司 | Method for Chinese homophone searching |
-
2013
- 2013-01-18 CN CN201310018105.5A patent/CN103116607B/en not_active Expired - Fee Related
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2009245181A (en) * | 2008-03-31 | 2009-10-22 | Nippon Telegr & Teleph Corp <Ntt> | Distributed full-text retrieval system, distributed full-text retrieving method, distributed full-text retrieval program and recording medium with the program recorded |
CN101930435A (en) * | 2009-10-27 | 2010-12-29 | 深圳市北科瑞声科技有限公司 | Method and system for retrieving organization names |
CN102609455A (en) * | 2012-01-12 | 2012-07-25 | 北京中科大洋科技发展股份有限公司 | Method for Chinese homophone searching |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107870919A (en) * | 2016-09-23 | 2018-04-03 | 伊姆西Ip控股有限责任公司 | The method and apparatus for managing index |
CN107291858A (en) * | 2017-06-09 | 2017-10-24 | 成都索贝数码科技股份有限公司 | Data indexing method based on character string suffix |
CN107729351A (en) * | 2017-08-29 | 2018-02-23 | 天翼爱音乐文化科技有限公司 | Multilayer inquiry correcting method and system based on music searching engine |
Also Published As
Publication number | Publication date |
---|---|
CN103116607B (en) | 2016-04-13 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105426539B (en) | A kind of lucene Chinese word cutting method based on dictionary | |
CN106326303B (en) | A kind of spoken semantic analysis system and method | |
TWI480746B (en) | Enabling faster full-text searching using a structured data store | |
CN102930031B (en) | By the method and system extracting bilingual parallel text in webpage | |
CN107818815B (en) | Electronic medical record retrieval method and system | |
JP2016522524A (en) | Method and apparatus for detecting synonymous expressions and searching related contents | |
CN104281702B (en) | Data retrieval method and device based on electric power critical word participle | |
CN109033307A (en) | Word polyarch vector based on CRP cluster indicates and Word sense disambiguation method | |
CN105608232B (en) | A kind of bug knowledge modeling method based on graphic data base | |
CN103365992B (en) | Method for realizing dictionary search of Trie tree based on one-dimensional linear space | |
CN104199965A (en) | Semantic information retrieval method | |
CN105912570B (en) | Resume critical field abstracting method based on hidden Markov model | |
CN105224518A (en) | The lookup method of the computing method of text similarity and system, Similar Text and system | |
CN102339294B (en) | Searching method and system for preprocessing keywords | |
CN104331446A (en) | Memory map-based mass data preprocessing method | |
CN102867049B (en) | Chinese PINYIN quick word segmentation method based on word search tree | |
WO2012159558A1 (en) | Natural language processing method, device and system based on semantic recognition | |
CN102929902A (en) | Character splitting method and device based on Chinese retrieval | |
CN112256861A (en) | Rumor detection method based on search engine return result and electronic device | |
CN106383814A (en) | Word segmentation method of English social media short text | |
CN101872363B (en) | Method for extracting keywords | |
CN103116607B (en) | A kind of text retrieval system based on the Chinese phonetic alphabet newly | |
CN104572619A (en) | Application of intelligent robot interaction system in field of investing and financing | |
CN102722526B (en) | Part-of-speech classification statistics-based duplicate webpage and approximate webpage identification method | |
Watrin et al. | An N-gram frequency database reference to handle MWE extraction in NLP applications |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20160413 Termination date: 20170118 |
|
CF01 | Termination of patent right due to non-payment of annual fee |