CN1588364A - Prime number replacing character string search technology - Google Patents

Prime number replacing character string search technology Download PDF

Info

Publication number
CN1588364A
CN1588364A CNA200410067258XA CN200410067258A CN1588364A CN 1588364 A CN1588364 A CN 1588364A CN A200410067258X A CNA200410067258X A CN A200410067258XA CN 200410067258 A CN200410067258 A CN 200410067258A CN 1588364 A CN1588364 A CN 1588364A
Authority
CN
China
Prior art keywords
prime number
character string
character
value
radical
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CNA200410067258XA
Other languages
Chinese (zh)
Inventor
徐文新
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CNA200410067258XA priority Critical patent/CN1588364A/en
Publication of CN1588364A publication Critical patent/CN1588364A/en
Priority to PCT/CN2005/001493 priority patent/WO2006058476A1/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Document Processing Apparatus (AREA)

Abstract

Several prime numbers, N1, N2 ... are used in replacing several master characters, P1, P2 ..., the product of several prime numbers, N1*N2..., named F value, is used in replacing the character string H comprising these master characters, so are F1, F2, F3, F4 ... for the character strings H1, H2, H3, H4 ..., and thus, a character string information base is established. When Fn may be divided exactly by N1*N2..., the corresponding character string Hn contains the master characters P1, P2 ... corresponding to N1, N2 ..., and thus the fuzzy search of character string in digital base is realized. The similar search method may be also performed to long integral data. Similarly, when the Chinese character radicals P1, P2 ... are assigned with prime numbers, N1, N2 ..., Chinese characters H1, H2, H3, H4 ... may have corresponding radical products F1, F2, F3, F4 ... and may be searched via their radical combinations.

Description

Prime number replacing character string search technology
Technical field
The present invention a kind ofly represents base character with prime number, represent character string with the prime number product value, the prime number product value is carried out division arithmetic with the product of a prime number or several prime numbers, as aliquot, or the modular arithmetic value is 0, can think that then its representative character string comprises the database retrieval technology of the pairing some characters of these prime numbers or certain several character.Fundamental purpose is to realize " with any Chinese character of radical combined retrieval of any level " also can be used for improving simultaneously the fuzzy search of general database character string.
Background technology
Hard to tackle Chinese character can not adopt the input of all-phonetic input method or ISN usually with general font code inputs such as five fonts, and inefficiency is not seen effective workaround so far.
On the other hand, the character string fuzzy search adopts by turn manner of comparison to carry out in the database, as judge whether comprise character f among the character string bdopfqew, computing machine begins to compare by turn from b to character string with f, efficient is not high, even character string field in the database is carried out index, can not improve the efficient of fuzzy search effectively.
Study a kind of effective Chinese character index lookup method, improve general database character string fuzzy search technology simultaneously, the many-sided benefit of literate and economic dispatch,
Summary of the invention
The retrieval that the present invention is primarily aimed at hard to tackle Chinese character proposes, so hereinafter emphasis explains with regard to Chinese character index, other database character string fuzzy search can be with reference to enforcement.
You “ Ji in the GBK scope ", but font code can not be imported this word usually, is ji so must check in its pronunciation by dictionary, imports ji again in all-phonetic input method, translates into 28 pages and “ Ji just occurs " word, waste time and energy.Chinese character is made of radical, should work out a Chinese radical database, utilizes the composition radical to search corresponding Chinese character.As “ Ji " constitute De “ Fishnet wherein by two radicals of Lv, Fishnet " constitute by Si, factory, Yan again, " Yan " wherein is made of scorching, Dao again, and " inflammation " wherein fire, fire again constitutes.Because different users is variant to the decomposition meeting of radical, so desirable Chinese radical database must comprise any level radical of a Chinese character; must resolve into by Lv, Fishnet, Si, factory, Yan, inflammation, Dao, fire, fire as " Fishnet ", could realize the purpose of " any Chinese character retrieved in the radical of level arbitrarily ".
On this basis, for writing of simplifying procedures, improve retrieval rate, the present invention proposes to represent with several prime numbers N hundreds of basic radical P of Chinese character, by several basic radical P1, P2 ... the higher level's radical S that forms is the value of giving M=N1*N2 then ... and whole Chinese character H are all by be made of radical P1, P2, P3 it ... corresponding prime number N1, N2, N3 ... product F value, and the F value also is different level radical S1, S2 simultaneously ... corresponding M 1, M2 ... product.
Prime number N, prime number product M and H by basic radical P, higher level's radical S, whole Chinese character H correspondences can make up the character structure information database.The F1 value if can be divided exactly by N1 in the database, and then corresponding Chinese character H1 must comprise corresponding radical P1; The Fn value that can not be divided exactly by N1, corresponding Chinese character Hn does not then contain corresponding radical P1.
Equally, the F2 value if can be by N1*N2 in the database ... divide exactly, then corresponding Chinese character H2 must comprise corresponding radical P1, P2 simultaneously ..., can not be by N1*N2 ... the Fn value that divides exactly, corresponding Chinese character Hn does not then comprise corresponding radical P1, P2 simultaneously ...
Therefore, can realize " any Chinese character retrieved in the radical of level arbitrarily " speed and conveniencely.
One, the assignment cardinal rule of radical typical value N:
1. represent a basic radical with a prime number N usually, as: 7 are representative " wood ", 11 representatives " Ren ".
Also can represent the nearly radical of several shapes by a prime number N, reason is to have in the Chinese character and is difficult for the completely radical of differentiation, is " day " between among " not ", or " saying ", and is not directly perceived, so give same prime number 47 for " day " and " saying ".Elsewhere, as being a kind of query and search technology, be promptly to check in the result, and tabulate and select at last, so an available prime number is represented several seldom used radicals simultaneously, as representing with 3581 for the user Radicals such as " Marginwidth ".
2.GBK 21000 Chinese characters of scope, can use 600-700 prime number N less than 5000, represent basic radical P, it then is the product of the corresponding prime number of basic radical that about 2000 higher levels compile other M value, be N1*N2, the F value of all the other about 20000 Chinese characters is the product of the corresponding prime number of basic radical, i.e. N1*N2 ... also be the product of the value M of radicals at different levels, i.e. M1*M2 simultaneously ...
3. prime number N given in basic radical, should give less prime number N preferentially for the basic radical that constitutes complicated Chinese character, excessive with the F value of avoiding complicated Chinese character, causes overflowing, or unnecessarily adopt the multibyte data type to hold the F value, reduces performance.As: formation Cuan, Kui,
Figure A20041006725800064
The fire of words such as Jiangxi, Shu, Ba, big, wood, Rui, , separate, dog etc. substantially radical should give prime number N as far as possible less than 199 so that the F value of these Chinese characters is not excessive, be necessary.
Two, the cardinal rule of the data type of Chinese character typical value F selection:
1.GBK the data type of 21000 Chinese character F of scope value, to adopt double precision comparatively suitable, but should be principle less than 1.00000000000000E+15 with whole F values, F value greater than this value is carried out truncation when avoiding computer storage, causing this part F value is not corresponding N1*N2 ... product, and cause this part F value invalid.
2. if wish to accelerate retrieval rate, the F value can adopt the data type of long, but the F value of 21000 Chinese characters of GBK scope is wayward in 2147483647.If the careful folding branch of the degree of depth do not made in Chinese character, and adopt big radical to realize, but do not reach the purpose of " retrieving any Chinese character " with the radical of any level.In addition, also can adopt a prime number N to represent the method for several basic radicals, can occur non-purpose Chinese character simultaneously but cost is a query and search.
3. for exceeding GBK scope Chinese character, so that Chinese character surplus in the of whole 50,000 realizes " retrieving any Chinese character with the radical of any level ", and realize the nearly word of shape simultaneously, complicated and simplely look into mutually, or be necessary to take more multibyte data type.As monetary data type maximal value 922 337 203 685 477.5807, save as the monetary data type after can removing 10000 to the F value, during inquiry then correspondingly with on duty 10000, realize the storage computing of F value easily, but monetary data type operation speed is slow far beyond double precision, whether adopt, should decide on computing power and application demand.Decimal data type precision is higher than the floating type of double precision, and its maximal value is 79 228 162 514 264 337 593 543 950 335, is used for representing that the F value is very convenient, but its operational performance waits assessment.
Three, advise about the computing of database:
1. character string field in the database is carried out index, can not improve the efficient of fuzzy search effectively.The character string performance was the prime number product after employing the present invention changed, thereby can carry out index and sort to raise the efficiency.If the retrieval radical is P1, P2 ..., then Dui Ying divisor is N1*N2 ..., can utilize index function, in the database less than N1*N2 ... the Fn value, do not carry out computing, only calculate and equal greater than N1*N2 ... the Fn value.For effectively utilizing this characteristics, can give big prime number, the then N1*N2 that inquires about time and again to radical commonly used ... value average is bigger, helps reducing operand.
2. modular arithmetic value is 0, and divides exactly the computing same meaning, but lint-long integer is only supported in modular arithmetic.
Embodiment
Embodiment 1:
Represent basic radical, the stroke of Chinese character with 400 to 600 prime number N, as representing " greatly " with 2,3 representatives " fourth ", 5 representatives " mouth ", 7 representatives " wood ", 11 representatives " Ren ", again with the product F value of the prime number of these basic radicals to whole Chinese character assignment, as " can " constitute by " fourth " " mouth ", 3*5=15, then " can " the F value be 15.How in like manner, the F value of " very " is 2*5*7=30, and the F value of " Ke " is 7*3*5=105, and the F value of " " is 11*3*5=165.Thus, can make up the database that comprises all Chinese characters and respective value F thereof.
After a logical this mode was given whole Chinese character assignment, if certain Chinese character comprises some radicals by which characters are arranged in traditional Chinese dictionaries, then the F value of this Chinese character must divide exactly for the prime number N of a certain radicals by which characters are arranged in traditional Chinese dictionaries.Comprise " mouth " as " Ke ", the F value of " Ke " must divide exactly for the prime number N " 5 " of " mouth "; From another side, just all F values can must be contained " mouth " side by the Chinese character that divide exactly N " 5 " in the database.And all values F can must be contained (3*5=15) two radicals of " fourth " " mouth " simultaneously by 15 words of dividing exactly, by this kind mode can find all comprise " can " other Chinese character.
This mode is particularly suitable for searching hard to tackle and not knowing the Chinese character of pronunciation with several common radicals, as searching with in the radicals such as " wood, narrow-necked earthen jar, Mi, Yichang, Qe, an ancient type of spoons " several
Figure A20041006725800081
Test condition: operating system Window xp, the 800Hz of Celeron, internal memory 256M, Jijia's 810 mainboards.
Test result: make up 21000 Chinese character information databases of GBK scope with access, the F value adopts the double precision datum type, is programming with VB6.0, with any radical and radical query composition Chinese character, and preliminary test, average time-consuming in 1 second.
Embodiment 2:
Represent 56 English alphabets such as clump a-z and A-Z with 56 prime numbers such as 2,3,5,7,11,13,17,19,23,29,31,37,41,43,47,53,59,61,67,71,73,79,83,89,97,101,103,107,109,113,127,131,137,139,149,151,157,163,167,173,179,181,191,193,197,199,211,223,227,229,233,239,241,251,257,263, then english can be represented with the product F of these prime numbers.As the able value is 2*3*37*11=2442, may set up the database of an English word and corresponding F value thereof thus.And data be in all suffix contain the word of able, must be divided exactly by 2442, thereby can search the word that all suffix contain able with being divided exactly by 2442.
Search non-purpose speech such as to retrieve bale with this method, be not accurate inquiry, but can dwindle range of search effectively, make quadratic search with the character manner of comparison then and realize accurately inquiry.

Claims (8)

1. character string fuzzy search technology is characterized in that:
With several prime numbers N1, N2 ... represent several base characters P1, P2 ... product N1*N2 with several prime numbers ... be called the F value, the character string H that represents corresponding base character P1P2 to form, then character string H1, H2, H3, H4 ... product F1, F2, F3, F4 by the corresponding prime number of forming base character separately ...If Fn can be by N1*N2 ... divide exactly, then corresponding character string Hn comprises N1, N2 ... pairing base character P1, P2 ... thereby, realize the character string fuzzy search.For the lint-long integer data type, also available modular arithmetic value is 0 realization retrieval.
2. in accordance with the method for claim 1, it is characterized in that: based on this, simplify the program code that character string is carried out fuzzy search with multinomial keyword and write.
3. in accordance with the method for claim 1, it is characterized in that: to character string typical value F in the database sort, index, wherein less than N1*N2 ... the F value can divide exactly or modular arithmetic, to raise the efficiency.
4. in accordance with the method for claim 1, it is characterized in that: the Chinese terms, the proper noun database that with the Chinese character are base character, for avoiding adopting big prime number to cause data to overflow, use a prime number to represent several Chinese characters to make up database, with dividing exactly or modular arithmetic reduces the scope as preliminary search, again with common character string by turn comparative approach do further retrieval.
5. according to claim 1 and 4 described methods, it is characterized in that: in the other Languages, with the word is phrase, the proper noun database of base character, for avoiding adopting big prime number to cause data to overflow, represent several words to make up database with a prime number, reduce the scope as preliminary search, again with the further retrieval of comparative approach do by turn of common character string.
6. it is characterized in that in accordance with the method for claim 1: give the basic radical P1 of Chinese character, P2 ... tax is with prime number N1, N2 ..., and Chinese character H1, H2, H3, H4 ... then by product F1, F2, F3, F4 by the corresponding prime number of forming radical separately ...If Fn can be by N1*N2 ... divide exactly, then corresponding Chinese character Hn comprises N1, N2 ... pairing basic radical P1, P2 ... thereby, fast and effeciently use any Chinese character of radical combined retrieval of any level.
7. according to claim 4 and 6 described methods, it is characterized in that: cardinal rule is to represent a basic radical with a prime number, but also can represent several close radicals with a prime number.
8. in accordance with the method for claim 1, it is characterized in that: in the other Languages, be that unit gives prime number N with the syllable, the F value of a word is formed syllable P1, P2 for it ... the product of corresponding prime number, be N1*N2 ..., make up lexical data base with this, carry out fuzzy search.
CNA200410067258XA 2004-10-19 2004-10-19 Prime number replacing character string search technology Pending CN1588364A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CNA200410067258XA CN1588364A (en) 2004-10-19 2004-10-19 Prime number replacing character string search technology
PCT/CN2005/001493 WO2006058476A1 (en) 2004-10-19 2005-09-19 Prime number replacing character string search technology

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNA200410067258XA CN1588364A (en) 2004-10-19 2004-10-19 Prime number replacing character string search technology

Publications (1)

Publication Number Publication Date
CN1588364A true CN1588364A (en) 2005-03-02

Family

ID=34604142

Family Applications (1)

Application Number Title Priority Date Filing Date
CNA200410067258XA Pending CN1588364A (en) 2004-10-19 2004-10-19 Prime number replacing character string search technology

Country Status (2)

Country Link
CN (1) CN1588364A (en)
WO (1) WO2006058476A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108573012A (en) * 2017-11-24 2018-09-25 北京金山云网络技术有限公司 A kind of data processing method, device, equipment and storage medium
CN109858969A (en) * 2019-01-31 2019-06-07 上海玄霆娱乐信息科技有限公司 A kind of favourable price based on business configuration determines method and system

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE102016215809A1 (en) * 2016-08-23 2018-03-01 Siemens Aktiengesellschaft Monitoring a display of a cab of a means of transport

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2002157244A (en) * 2000-11-20 2002-05-31 Ricoh Co Ltd Device and method for analyzing japanese morpheme and storage medium
US7120248B2 (en) * 2001-03-26 2006-10-10 Hewlett-Packard Development Company, L.P. Multiple prime number generation using a parallel prime number search algorithm
CN1461136A (en) * 2003-06-10 2003-12-10 姚庆生 Method for quickly-searching internal telephone book and its telephone set

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108573012A (en) * 2017-11-24 2018-09-25 北京金山云网络技术有限公司 A kind of data processing method, device, equipment and storage medium
CN109858969A (en) * 2019-01-31 2019-06-07 上海玄霆娱乐信息科技有限公司 A kind of favourable price based on business configuration determines method and system

Also Published As

Publication number Publication date
WO2006058476A1 (en) 2006-06-08

Similar Documents

Publication Publication Date Title
Wang et al. Spatten: Efficient sparse attention architecture with cascade token and head pruning
Zoumpatianos et al. ADS: the adaptive data series index
CN108573045B (en) Comparison matrix similarity retrieval method based on multi-order fingerprints
US6092038A (en) System and method for providing lossless compression of n-gram language models in a real-time decoder
Hsu et al. Space-efficient data structures for top-k completion
US6725223B2 (en) Storage format for encoded vector indexes
US7765214B2 (en) Enhancing query performance of search engines using lexical affinities
EP3314464B1 (en) Storage and retrieval of data from a bit vector search index
CN1786962A (en) Method for managing and searching dictionary with perfect even numbers group TRIE Tree
EP3314468B1 (en) Matching documents using a bit vector search index
WO2016209975A2 (en) Preliminary ranker for scoring matching documents
WO2014126822A1 (en) Determining documents that match a query
Pomikálek et al. Building a 70 billion word corpus of English from ClueWeb.
CN1661593A (en) Method for translating computer language and translation system
EP3314465B1 (en) Match fix-up to remove matching documents
CN1687925A (en) Method for realizing bilingual web page searching
CN1920831A (en) Method and system for managing object information on network
CN112115361B (en) Data retrieval optimization method and system based on elastic search
CN1845105A (en) Information retrieval and processing method based on ternary model
US20100191717A1 (en) Optimization of query processing with top operations
CN1889080A (en) Method for searching character string
CN1588364A (en) Prime number replacing character string search technology
CN101944086A (en) Whole word index dictionary
Hawker et al. Practical queries of a massive n-gram database
EP3314467B1 (en) Bit vector search index

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C12 Rejection of a patent application after its publication
RJ01 Rejection of invention patent application after publication

Open date: 20050302