CN100483409C - Word data searching method - Google Patents

Word data searching method Download PDF

Info

Publication number
CN100483409C
CN100483409C CN 200510131812 CN200510131812A CN100483409C CN 100483409 C CN100483409 C CN 100483409C CN 200510131812 CN200510131812 CN 200510131812 CN 200510131812 A CN200510131812 A CN 200510131812A CN 100483409 C CN100483409 C CN 100483409C
Authority
CN
Grant status
Grant
Patent type
Prior art keywords
index
data
character
character data
corresponding
Prior art date
Application number
CN 200510131812
Other languages
Chinese (zh)
Other versions
CN1776688A (en )
Inventor
林剑峰
亮 陈
Original Assignee
北京金山软件有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Grant date

Links

Abstract

本发明涉及一种字符数据的检索方法。 The present invention relates to a method for retrieving character data. 包括:1)提取源数据库中字符数据的前两位字母组合作为索引项建立索引;2)记录每个索引项所对应的字符数据的存储地址;3)加载所述索引到内存;以及,4)根据输入关键字的前两位字母加载该字母组合索引项对应的字符数据到内存;5)遍历内存中的字符数据,输出与关键字匹配的数据。 Comprising: 1) a combination of the first two letters of the source database to extract character data index as an index item; 2) storing recording address corresponding to each index item character data; 3) index is loaded into the memory; and, 4 ) loading the first two letters of the index entry input keywords corresponding to the alphabetic character data into the memory in accordance with a combination; 5 data character data) memory traversal, the output matches the keyword. 本发明能够快速的反馈与关键字相似的字符数据,并且不仅降低了内存占有量,也具有较高的匹配速度。 The present invention enables rapid feedback keyword similar character data, and not only reduces the amount of memory, but also have a higher matching rate.

Description

—种字符数据的检索方法 - Retrieval types of character data

技术领域 FIELD

本发明涉及一种数据检索技术,尤其是字符类数据的检索方法。 The present invention relates to a data retrieval technique, especially in a character type data retrieval method. 背景技术 Background technique

用户在进行词典类数据的查询操作中,用户往往出现不能十分确定查询关 Query the user dictionary during class data, users are often not quite sure query off

键字的具体拼写的情况;或者,用户需要查询具有部分相同字符的数据。 Where specific spelled word bond; or, the user needs to query the data portion having the same character. 针对以上两种情况,可以依据用户输入的不确定的字符串,从源数据中查找出一系列与之相似的数据,供用户参考,同时也支持包括通配符的关键字的查找。 For both cases, you can look for a string based on user input from an uncertain source in a series of data with similar data for reference, but also support including wildcards to find keywords. 在关键字中加入通配符"?,,或"*,,以查找所有与之预期相同或者相近的字符数据。 Adding keywords wildcard "? ,, or" * ,, to find the same or similar character expected of all that data. 其中通配符"?,,匹配文件名中的单个字符,而通配符"*"匹配零个或多个字符。例如data?.dat的模式所能查找到的文件包括:datal.dat或者dataN.dat;当使用'*'字符代替'?'字符时(即data*.dat的模式),则会扩大检索出的文件范围,则下列名称的文件将出现在查询结果中:datal2222.dat或者datal2XF.dat等。 Where the wildcard * "matches zero or more characters such as the mode data .dat file that can be found include: datal.dat or dataN.dat;" ,, matching file names in a single character, and the wildcard?. "? '?' when you use '*' character instead of the character when (ie data * .dat mode), it will expand the scope of the retrieved documents, the following file name will appear in the query results: datal2222.dat or datal2XF.dat Wait.

如本领域技术人员所知,关键字的查找方法是对源数据中所有数据逐一进行比对。 As those skilled in the art, it is the key method to find the source data all data one by one comparison. 所述源数据通常是保存在磁盘上的单词索引文件。 The source data is usually a word index file stored on disk. 由于对磁盘操作的速度较慢,并且逐一进行对比的算法较为繁瑣,因而,现有技术不能提供更高的检索速度。 Since the slow speed operation of the disk, and one by one comparison algorithm is more complicated, and therefore, the prior art does not provide a higher speed of retrieval. 现有技术的一种改进方法是先将磁盘文件中的数据读入内存,进而在内存中进行字符匹配的操作,但是由于不同检索应用中的索引文件大小不同,尤其在索引文件较大的情况下,将其全部读入内存会占用相当部分的内存空间,进一步由于该改进方法仅改变了匹配操作的环境,而依然沿用了现有技术中较为繁瑣的字符匹配算法,因而所述对现有技术的改进并没有明显的提高字符查找的速度。 An improved method of prior art disk file is first read into the data memory, and further operating character match in memory, but because of the difference in the index file retrieval application size, especially in the case where a large index file next, the entire memory will be read into the memory space occupied a considerable part, due to the improved process further only the matching operation changes the environment, but still follows the prior art character matching algorithm is more complicated, and therefore the existing improvements in technology did not significantly increase the speed of character look. 发明内容 SUMMARY

本发明的目的是提供一种字符数据的查找方法,该方法能够快速的反馈与 Object of the present invention is to provide a character data lookup method capable of rapid feedback and

关^:字相似的字符数据。 Off ^: word similar character data.

为解决上述技术问题,本发明的目的是通过以下技术方案实现的:l)提 To solve the above problems, an object of the present invention is achieved by the following technical solutions: l) extract

取源数据库中字符数据的一位字母组合的索引项和前两位字母组合作为索引项分别建立索引,其中该一位字母组合的索引项对应于以该字母为首字母,且 Take a letter character data in the source database index entry and the combined first two letter combination as an index item index, respectively, wherein a combination of letters of the index entry corresponding to the alphabet letter headed, and

前两位为字母与非字母组合的字符数据;2)记录每个索引项所对应的字符数据的存储地址;3)加载所述索引到内存,其中该元组数据包括:二级索引项; 该二级索引项所对应源数据中第一个字符数据在索引文件中的偏移P;该二级索引项所对应源数据中第一个字符数据在索? Data for the first two letters and non-alphabetic characters in combination; 2) storing recording address corresponding to each index item character data; 3) index is loaded into the memory, wherein the data tuple comprises: the secondary index item; the first character in the source data P offset data in the index file in the index entry corresponding to two; the two index entries corresponding to the source data in the first character data index? j文件中的偏移与以下一个二级索引项所对应源数据中的第一个字符数据在索? The first character in the data source file offset index j with the index entry corresponding to a two? j文件中的偏移之差Pc;以及, 4)查找索引中是否有与输入关键字前两位字符相匹配的索引项,如有则根据输入关键字的前两位字母加载该字母组合索引项对应的字符数据到内存,否则加载与输入关键字首位字符相匹配的索引项到内存;5)遍历内存中的字符数据,输出与关键字匹配的数据。 Pc difference file offsets j; and, 4) find whether there is an index entry with a key input matches the first two characters of the index, if the index is loaded in accordance with the combination of the first two letters letter input keys entry corresponding to the character data memory or load input key matches the first character of the index entry in the memory; 5 data character data) memory traversal, the output matches the keyword.

上述方法中,3)中进一步将索引保存到哈希表。 The above method, 3) is further stored the index into the hash table. 所述源数据库中字符数据按字符顺序进行排列;则2)具体为记录确定的索引项所对应第一个字符数据的地址,以及进一步记录包括确定的索引项所对应字符数据的数量。 Character data are arranged alphabetically in the source database; the address 2) determining the particular record index data item corresponding to the first character, and the character data further comprises recording the number of index entries corresponding to the determined. 本发明所述非字母字符包括:空格符、数字、分隔符等。 The present invention is non-alphanumeric characters comprising: a space character, digits, separators and the like.

以上技术方案可以看出,本发明是一种字符数据的检索方法,该方法中, 在设计索引文件的时候建立了二级索引,及将所有字符数据最前面的两个字母作为二级索引项,并记录每二级索引项所对应单词的存储位置;并且,在每次查词操作的时候仅根据查找关键字的前两位字符加载相应的字符数据到内存, 进而在该字符数据的集合范围内进一步遍历进行匹配操作等。 It can be seen above technical solutions, the present invention is a method for retrieving character data, the method, in the design of the index file to establish a secondary index, and all character data first two letters as a secondary index entry and recording the storage location of the corresponding word of each secondary index item; and loaded each time search words according to the operation of only the first two characters of the search key corresponding to the character data memory, further data in the character set further traversing the range matching operation. 因而,本发明具有查词速度快的优点。 Accordingly, the present invention has the advantage of faster speed search words.

进一步,本发明采用哈希表管理二级索引的信息,即在每次程序运行时, 将所述二级索引与索引项下字符数据的存储位置读入哈希表,以便于之后的查找。 Further, the present invention employs a hash table managing information of the secondary index, i.e., every time the program runs, the storage position of the lower secondary index item and the index a hash table to read the character data in order to find the following. 每次对源模型的查找只需要对用户输入的关键字进行分析,剥离出二级索引,再对哈希表进行查找从而找到相对索引项的偏移并将从此偏移开始的单词 Once for each of the source model only requires the user to enter a keyword analyzed peeling the secondary index, and then the hash table lookup to find the index entries and offset relative offset from the beginning of the word

集合块读入内存从而进行查找和匹配。 Set of blocks read into memory thereby performing search and matching. 这样不仅进一步降低了内存占有量,也提升了匹配速度。 This will not only further reduces the amount of memory, but also enhance the matching speed.

附图说明 BRIEF DESCRIPTION

图1为二级索引结构图; FIG 1 is a configuration diagram of a secondary index;

图2为本发明实施例流程图; 2 flowchart embodiment of the present invention;

图3为目标模型与源模型匹配流程图。 FIG 3 is a model matching the target model with the source flowchart.

具体实施方式 detailed description

本发明涉及字符数据的检索方法。 The present invention relates to a method for retrieving character data. 其中包括以下概念。 These include the following concepts.

源数据:指所有被搜索的数据。 Data source: refers to all the data to be searched. 如在电子词典中,所述源数据即指所有单词或词组的集合,这些单词或词组通常以名称为单词索引的磁盘文件的形式存在。 As electronic dictionary, the source data set refers to all the words or phrases, in the form of disk files typically these words or phrases in the word index name.

源模型:将用户输入的欲查找的字符串称为源模型(Source Pattern ),亦为搜索关键字,源模型中可包括特殊的字符串。 Source Model: To find a string entered by the user is called source model (Source Pattern), is also a keyword search, the source model may include special strings. 数据集中于之比对的字符串成为"目标模型(Target Pattern)" Data focus on the ratio of the string of a "target model (Target Pattern)"

目标模型:与用户所输入的字符串相似的单词集合称为目标模型,目标模型可以理解为依据用户的输入,从源数据划分出的小范围数据集合。 Target model: a character string input by the user a similar word set is called a target model, the target model to be understood that depending on the user's input, the source data from the divided small data sets.

以下说明本发明的具体实施方式。 DETAILED DESCRIPTION The following embodiments of the present invention. 本发明优选实施例的核心在于:在源数据中建立二级索引,建立二级索引文件,即提取源数据库中字符数据的前两位字母组合作为索引项建立索引和建立一位字母组合的索引项,该索引项对应于以该字母为首字母,且前两位为字母与非字母组合的字符凄t据;在内存中,采用哈希表保存所述二级索引文件,即所述索引项与其所对应的字符数据的对应关系;以及,查找索引中是否有与输入关键字前两位字符相匹配的索引项,若有则加载该字母组合索引项对应的字符数据到内存;否则加载与输入关键字首位字符相匹配的索引项到内存;最后,遍历内存中的字符数据,输出与关键字 The core of the preferred embodiment of the present invention comprising: establishing a secondary index in the source data, the establishment of the secondary index file, i.e., the first two letters combined extracts character data in the source database indexing and indexing a combination of letters as an index item entry, the index entry corresponding to the letter to letter headed, and the first two data characters sad letter t in combination with non-alphabetic; in memory, using a hash table to store the two index file, i.e., the index entry correspondence between its corresponding character data; and, if there is an index entry and enter the keyword to match the first two characters to find in the index, if the index entry corresponding to the combination of letters character data is loaded into memory; otherwise loaded with first character key input matches the index entry in the memory; and finally, to traverse a character data memory, the output of keywords

匹配的IU居。 IU matching top.

由上可知,本发明的核心之一在于建立二级索引。 From the above, one of the core of the present invention is to establish the secondary index. 图1为本发明二级索引结构图,参照该图具体说明二级索引的建立。 FIG secondary index structure of FIG. 1 of the present invention, described in detail with reference to FIG establish the secondary index.

本发明所建立的二级索引是将源数据中所有字符数据最前面的两个字母做成二级索引;以及,进一步建立一位字母的索引项,该一位字母的索引项对应于以该字母为首字母,且前两位为字母与非字母组合的字符数据。 Secondary index is established according to the present invention, the source data frontmost all character data made of two letters secondary index; and further establish an alphabetical index entries, the index of an entry corresponding to the letter to the letter headed letters, and the first two letters and character data as a non-letter combinations. 并记录以下信息:二级索引项;该二级索引项所对应源数据中第一个字符数据在索引文件中的偏移(P);该二级索引项所对应源数据中第一个字符数据在索引文件中的偏移与以下一个二级索? And the following information: the secondary index item; offset source data of the first character data in the index file (P) corresponding to the two index entries; source data of the first character of the index entry corresponding to the two data offset in the index file with one of the following two search? ! 项所对应源数据中的第一个字符数据在索引文件中的偏移之差(Pc)。 A difference between the first character item data source in the index file offsets (Pc) corresponding to.

将二级索引项、所迷字符数据的偏移(P)以及所述偏移之差(Pc)这三个数据组成一个元组(tuple),附加在索引文件前面, 一个元组结构是(二级索引,P, Pc)。 The two index entries, the offset character data (P) and the difference (Pc) to form a three data tuple (tuple) the offset of the fans, the index file is prepended, the structure is a tuple ( secondary index, P, Pc).

如图,二级索引项ab的元组结构为:ab:323:745 ,其中以ab开头的第一个单词是abacus,这个单词在索引文件中的偏移是323,其中以ac开头的第一个单词是academic,这个单词在索引文件中的偏移是1068,之差为745,说明以ab开头的单词共有745个。 As shown, the structure of the tuple of two index entries ab: ab: 323: 745, wherein the first word to the beginning of the ab Abacus, offset word in the index file is 323, wherein the first beginning ac a word is academic, the word offset in the index file is 1068, the difference was 745, indicating that word beginning with ab total of 745. 进一步如图l所示,其中二级索引项a的元组结构为:a:0:323,该索引项所对应的源数据为以a为首字母的词组或字符组合, 如图,第一个字符数据为acaseinpoint,该词组在索引文件中的偏移为1 ,且索引项a下的最后一个字符数据为AM,其偏移为322,则索引项a对应源数据的字符数据数为323。 Further shown in FIG. L, wherein the tuple structure is a secondary index item: a: ​​0: 323, the index entry corresponding to the source data is a letter headed phrase or characters, as shown, the first the character data is acaseinpoint, the phrase in the index file offset of 1, and the index of the last character of a data item in the AM, which is offset 322, a character data of the index entry corresponding to the number of data sources 323.

图2为本发明实施例流程图,参照该图说明本发明的实现方式。 2 flowchart embodiment, described with reference to FIG embodiment of the present invention achieved the present invention.

步骤21:在现有源数据的索引文件基础上,创建二级索引,将二级索引和偏移等记录在元组中,"te所有二级索引的元组附加到索引文件前面;所述二级索引的建立方法参照上文说明; Step 21: In the current source data based on the index file, create a secondary index, and the secondary index offset, etc. recorded in a tuple, "tuple te all secondary index prepended to the index file; the establishing method described above with reference to the secondary index;

步骤22:程序启动后,加载索引文件前面的所有元组数据到内存中,并保存到哈希表中; Step 22: When the program starts, all tuples in front of the index file data loading into memory, and stores the hash table;

所述哈希表是一种数据结构,其基本思想是根据当前待查找数据的特征, 以记录关键字为自变量,设计一个fimction,该函数对关键字进行转换后,其解释结果为待查的地址。 The hash table is a data structure, the basic idea is to be searched according to the current characteristic data, to record keys as independent variables, the design of a fimction, the key conversion function, which is interpreted as a result of unknown origin the address of. 具体的,哈希表根据关键值码直接进行访问,即通过把关键值码映射到表中一个位置来访问记录,以加快查找的速度;所述映射函数叫做哈希函数,例如:F(a)=b,其中F()为哈希函数,a表示键值,b为返回值。 Specifically, the hash table according to the key value of the code accessed directly, i.e., by mapping the code checks the key position to the table to access a record, to speed up to find; mapping function called the hash function, for example: F (a ) = b, where F. () is a hash function, a represents key value, b is the return value. 相对于树型结构、列表结构等数据结构的查找,哈希表的检索效率较高; With respect to the lookup data structure tree structure, a list structure, a high retrieval efficiency of the hash table;

本实施例中采用哈希表保存建立的二级索引,每次程序启动的时候将这些数据读入哈希表中,每次对源模型的查找只需要对用户输入的单词进行分析, 剥离出二级索引再对哈希表进行查找从而找到相对索引的偏移并将从此偏移开始的单词集合块读入内存从而进行查找和匹配,这样不仅降低的内存占有量,也提升了匹配速度; The present embodiment stored hash table created using the secondary index embodiment, each time when the program starts to read data into a hash table lookup for each source word model only needs to analyze the user's input, the stripping then the secondary index hash table lookup to find the offset and the relative index of words from the start set of offset blocks read into memory thereby performing search and matching, which not only reduce the amount of memory, but also enhance the rate matching;

步骤23:获取用户输入的关键字; Step 23: Get keyword entered by the user;

步骤24:取所述关键字字符串的前2位字符作为在哈希表中进行查找的关键值码(key),若字符串为一个字符,则将该字符作为关键值码; Step 24: Take the first two characters of the keyword string as a key code (key) is to find in the hash table, if a character string, then the character code as a key;

步骤25:依据关4建值码,利用哈希函数返回二级索引的偏移(P)和与上一个二级索引之间的偏移差(Pc); Step 25: 4 was built based off the code value, using the hash function returns two offset index (P) with an offset between upper and a secondary index difference (Pc);

若步骤24中获取的两位字符为字母组合或一个字母,则根据查找哈希表中与之匹配的二级索引项;例如:若所查关键字为abacus,则返回索引项ab 所对应字符数据在索引文件中的偏移以及所述偏移差; In step 24, if the acquired two is a letter or a letter combination, according to the two lookup hash table index entries matching; for example: if the search keywords Abacus, returns the index entry corresponding to the character ab offset data in the index file, and the offset difference;

若步骤24中获取的两位字符为字母和非字母的组合,则经查找发现哈希表中不存在与之匹配的二级索引项,则返回以该关4定字首字母为二级索引项的偏移(P)以及所述的偏移差(Pc);例如:若待查关键字为AM,则前两位字符组合为A.,经查找没有建立该字符组合的二级索引项,此时应返回索引项a所对应字符凄丈据在索引文件中的偏移即偏移差; In step 24, if two characters acquired combination of letters and non letters, the two found by the lookup index entry matching hash table does not exist, to set the first letter off 4 returns to the secondary index the offset term (P) and the offset difference (Pc); for example: if the keyword to be examined AM, the combination of the first two characters A., look through the index entry is not established two character combinations at this time the index item should return a character offset data corresponding to sad feet in the index file, i.e., the difference between the offset;

若关键字前两位字符组合中包含通配符"?"或"*",所述通配符"?" 和"*"通常被用来和普通字符(如a到z)组合,作为一个字符串进行查找, "?"代表一个字符,代表一串字符;如果通配符在第一位,则通常不具备操作意义,因而忽略;若通配符在第二位,基于通配符的定义,应返回首字母为关键字第一位字母的所有二级索引项的所述偏移和偏移差;例如:若输入关键字为a?acus,则基于通配符的定义,发现a?与a、 ab至az的所有二级索引字符组合均匹配,因而,返回以a为开头的所有二级索引项的偏移和偏移差; 对于通配符*,采用相同的算法操作;步骤26: ^v索引文件中找到该偏移值所对应的单词,并从这个单词开始偏移Pc个的单词,作为一个数据集合读到内存中,称为目标模型; If the first two characters in the keyword included in the combination wildcard "?" Or "*", the wildcard "?" And "*" is generally used, and an ordinary character (e.g., a to z) in combination, as a string to find , represent a character, representing the string of characters "?"; if the wildcard in the first place, you usually do not have operational significance, and therefore ignored; if wildcards in the second place, based on the definition of a wildcard, should return to the first letter of the first keyword the two index entries of all the letters of an offset and differential offset; example: if a keyword input acus, based on the definition of wildcard, and found that a a, ab until all of the secondary index az?? combination of characters match, and thus, return all offsets and offset in a secondary index entries beginning with the difference; for wildcard *, using the same arithmetic operation; step 26: locate the offset value ^ v index file corresponding to the word, the word from the start offset and a Pc word, as a set of data into memory, called the target model;

步骤27:遍历目标模型中的每个单词,分别与用户输入的源模型进行比对,返回所有匹配的数据,并进行显示。 Step 27: traversing each word in the target model, the source model are entered by the user to compare, return all matching data, and displayed.

步骤27中所述源模型与目标模型的对比操作分为以下三种情况: Step 27 in the comparison operation of the source model and the target model is divided into the following three cases:

1) 用户输入源模型中没有通配符,顺序比较源模型中每个位置的字符是否与目标模型中相同位置的字符都相同,若都相同,则说明匹配成功; 1) The user inputs the source model no wildcard character sequence comparison source model if the position of each character in the same position of the object model are the same, if the same, then the matching is successful;

2) 用户输入源模型中含有"?",首先判断源模型与目标模型的长度是否相等.如果相等则继续数据流否则返回,其中:若"?"出现在源模型头部, 这种情况一般没有操作意义,可忽略;若"?"出现在源模型尾部(如abacu?), 则将源模型中'?'之前各个位置的字符与目标模型相同位置的字符逐一顺序比较,如果各个位置的比较结果都相同则匹配成功;若"?"出现在源模型中部, 则应顺序比较源模型中'?'之前的字符,且逆序比较'?'之后的字符,如果都相等则匹配成功; 2) a user input containing the source model, the source model is first determined whether a length equal to the target model if equal, otherwise continue data stream, where "?":. "?", If present in the source model of the head, this is generally no operational significance, can be ignored; "?" '?' appears in the source model if the tail (? such as abacu), then the source model in order one by one before comparing the same character position of the character and location of the various target model, if each location comparison of the results are the same as the match is successful; "?" '?'? 'characters appear in the middle if the source model, you should compare the sequence of characters in the source model before and after the reverse comparison, if the matches are equal success;

3) 用户输入源模型中含有'*',若中出现在源模型中部(如:at^s),则将源模型中'*'前的所有位置的字母与目标模型相应位置的字母顺序进行比较,如果相应位置都相同,则进一步将源模型中'*'后的所有字母按照逆序与目标模型进行比较,如果相应位置都相同则匹配成功;若"*"出现在源模型尾部(如:ab", 只需将源模型中"*"之前各个位置的字母与目标模型相同位置的字母逐一顺序比较,如果各个位置的比较结果都想同,则匹配成功。 3) a user input contains source model '*' appears in the source model if the middle (eg: at ^ s), then the source model '*' alphabetical letters corresponding to the position of the target position before the model is all comparison, if the corresponding locations are identical, the further all the letters in the source model '*' after comparing in reverse order to the target model, if the corresponding locations are the same as the matching succeeds; if "*" appears in the source model tail (eg: ab & ", simply source model" * "position before the same letters and letter models for the respective target positions sequentially one by one comparison, if the comparison result with the respective position wants, then the match succeeds.

依据以上原则,参照图3,进一步说明当目标模型被加载到内存后,步骤27中遍历目标模型,分别与用户输入关键字(源模型)进行对比的操作。 Based on the above principles, with reference to FIG. 3, when the target model is further described is loaded into memory, step 27, traverse the target model, respectively comparison operation of the user input keyword (source model). 步骤31:分析源模型; Step 31: analysis source model;

步骤32:判断源模型中是否包含通配符;若不包含通配符则进行步骤33; 若通配符为"?",则进行步骤35;若通配符为"*,,,则进行步骤310; Step 32: determining whether the source model include wildcards; if not step 33 is performed comprising wildcards; if wildcards, then step 35; if the wildcard "* ,,, proceeds to step 310;"? "

步骤33:判断源模型的每个字符是否与目标模型中相同位置的字符相同, 若相同则进行步骤34,否则结束该次匹配操作; Step 33: The source model is determined for each character of the same character in the same position as the target model, the same as if step 34 is performed, otherwise, ends the sub-matching operation;

步骤34:返回目标;漠型; Step 34: the return destination; desert type;

步骤35:比较源模型核目标模型长度是否相同;若相同则进行步骤36, 否则结束此次匹配操作; Step 35: Compare the length of the source model models nuclear targets are the same; step 36 if the same is carried out, otherwise, ends the current matching operation;

步骤36:返回"?"所在位置;若"?"在源模型第一位,则结束此次匹配操作;若"?"在源模型最后一位,则进行步骤37;否则进行步骤38; Step 36: Return location; if the source model first, then the end of the matching operation; if the last one, go to step 37 in the source model; otherwise, step 38; "?" "?" "?"

步骤37:顺序比较源模型? Step 37: sequence comparison source model? 之前的每个字符是否与目标模型中相同位置的字符都相同,若相同则进行步骤34,否则结束此次匹配操作; Each character before characters whether the target model in the same position are the same, if the same is to step 34, otherwise the end of the matching operation;

步骤38:顺序比较源模型? Step 38: Compares source model? 之前的每个字符是否与目标模型中相同位置的字符都相同,若相同,则进行步骤39,否则结束此次匹配操作; Before each character whether the target model of the same characters in the same position, if yes, step 39, or the end of the matching operation;

步骤39:逆序比较源模型? Step 39: Comparison of reverse source model? 之后的每个字符是否与目标模型中相同位置的字符都相同,若相同,则进行步骤34,否则结束此次匹配操作; After each character whether the target model of the same characters in the same position, if yes, step 34, or the end of the matching operation;

步骤310:返回"*"所在位置;若"*,,在源模型第一位,则结束此次匹配操作;若"*"在源才莫型最后一位,则进行步骤33;否则进行步骤W1; Step 310: Return "*" location; if "* ,, in the source model first, then the end of the matching operation; if" * "in the last one source only Mo type, go to step 33; otherwise step W1;

步骤311:顺序比较源模型*之前的每个字符是否与目标模型中相同位置的字符都相同,若相同则进行步骤312,否则结束此次匹配操作;步骤312:逆序比较源模型*之后的每个字符是否与目标模型中相同位置的字符都相同,若相同则进行步骤34,否则结束此次匹配操作。 Each reverse after comparison source model *:;: Step 311 Step 312 * sequence comparison source model before each character whether the same character in the same position as the target model, the same as when step 312 is performed, otherwise, ends the current operation matches characters if the target model of the same characters in the same position, if the same is to step 34, otherwise the end of the matching operation.

以上仅为依据步骤27中所述匹配原则进行的源模型与目标模型匹配的具体实施例。 27 based on only the above steps embodiment the source model with the principle of matching the target model matching. 其中部分步骤间的执行顺序可进行调整,本文不再赘述。 Wherein the execution order between the step portion can be adjusted, it will not be repeated herein.

以上对本发明所提供的一种字符数据的检索方法进行了详细介绍,本文中应用了具体个例对本发明的原理及实施方式进行了阐述,以上实施例的说明只 Retrieval method of the above character data provided by the present invention are described in detail herein specific examples of the application of the principle and embodiments of the invention are set forth in the above described embodiment only

是用于帮助理解本发明的方法及其核心思想;同时,对于本领域的一般技术人员,依据本发明的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本发明的限制。 Is used to help understand the method and core idea of ​​the present invention; while those of ordinary skill in the art, according to the ideas of the present invention, in the embodiments and application scope of the changes, in the light, this specification shall not be construed as limiting the present invention.

Claims (6)

  1. 1、一种字符数据检索方法,其特征在于:1)提取源数据库中字符数据的一位字母组合的索引项和前两位字母组合作为索引项分别建立索引,其中该一位字母组合的索引项对应于以该字母为首字母,且前两位为字母与非字母组合的字符数据;2)记录每个索引项所对应的字符数据的存储地址;3)加载所述索引的元组数据到内存,其中该元组数据包括:二级索引项;该二级索引项所对应源数据中第一个字符数据在索引文件中的偏移P;该二级索引项所对应源数据中第一个字符数据在索引文件中的偏移与以下一个二级索引项所对应源数据中的第一个字符数据在索引文件中的偏移之差Pc;以及,4)查找索引中是否有与输入关键字前两位字符相匹配的索引项,如有则根据输入关键字的前两位字母加载该字母组合索引项对应的字符数据到内存,否则加载与输入关键字 A character data retrieval method, comprising: 1) a source database to extract letter character data item and the combined index as the first two letter combinations each index entry index, wherein the index of a letter combination entry corresponding to the letter to letter headed, and the character data is the first two letters and letter combinations in non; 2) storing recording address corresponding to each index item character data; 3) tuple data loaded into the index memory, wherein the data tuple comprises: the secondary index item; first character in the source data in the P data shift index file corresponding to the secondary index item; source data corresponding to the first index entry of the two offset data characters following a secondary index entry in the index file as the offset difference Pc of the first character in the data source corresponding index file; and, if there is an input 4) Find the index key index entry matches the first two characters, then the load if the index entry corresponding to a combination of letters in accordance with the first two letters of the character input key data to the memory, loading or input keywords 位字符相匹配的索引项到内存;5)遍历内存中的字符数据,输出与关键字匹配的数据。 Index entry matches the characters into memory; 5 character data) memory traversal, the output data matches the keyword.
  2. 2、 如权利要求1所述的字符数据检索方法,其特征在于: 3)中进一步将索引保存到哈希表。 2, character data retrieval method as claimed in claim 1, characterized in that: 3) is further stored in the index to the hash table.
  3. 3、 如权利要求1所述的字符数据检索方法,其特征在于: 源数据库中字符数据按字符顺序进行排列。 3, character data retrieval method as claimed in claim 1, wherein: the character data in the source database are arranged alphabetically.
  4. 4、 如权利要求3所述的字符数据检索方法,其特征在于:2)具体为记录确定的索引项所对应第一个字符凄t据的地址。 4, character data retrieval method as claimed in claim 3, wherein: 2) determining the particular record index entry address of the first character of the data corresponding to sad t.
  5. 5、 如权利要求4所述的字符数据检索方法,其特征在于:2)中进一步包括记录确定的索引项所对应字符数据的数量。 5, the character data retrieval method as claimed in claim 4, wherein: 2) further comprises determining the number of characters in the data record corresponding to the index entry.
  6. 6、 如权利要求1所述的字符数据检索方法,其特征在于: 所述非字母字符包括:空格符、数字、分隔符。 6, the character data retrieval method as claimed in claim 1, wherein: said non-alphabetic characters comprising: a space character, number, delimiter.
CN 200510131812 2005-12-15 2005-12-15 Word data searching method CN100483409C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 200510131812 CN100483409C (en) 2005-12-15 2005-12-15 Word data searching method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 200510131812 CN100483409C (en) 2005-12-15 2005-12-15 Word data searching method

Publications (2)

Publication Number Publication Date
CN1776688A true CN1776688A (en) 2006-05-24
CN100483409C true CN100483409C (en) 2009-04-29

Family

ID=36766180

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 200510131812 CN100483409C (en) 2005-12-15 2005-12-15 Word data searching method

Country Status (1)

Country Link
CN (1) CN100483409C (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101221481B (en) 2007-01-12 2011-08-24 华硕电脑股份有限公司 Handhold communication equipment and its prediction type input method
CN102402297B (en) * 2010-09-16 2015-04-22 北大方正集团有限公司 Component organization and input method of abbreviated character notation of seven-stringed plucked instrument
CN103514404A (en) * 2012-06-29 2014-01-15 网秦无限(北京)科技有限公司 Safety detection method and safety detection device
CN103970605A (en) * 2013-02-06 2014-08-06 珠海世纪鼎利通信科技股份有限公司 Low-performance terminal based data analysis method and device
CN103488709B (en) * 2013-09-09 2017-06-16 东软集团股份有限公司 An index that establish methods and systems, and retrieval system
CN105589862A (en) * 2014-10-21 2016-05-18 杭州华为企业通信技术有限公司 License plate data index structure building method, retrieval method and device
CN104778197A (en) * 2014-12-30 2015-07-15 北京锐安科技有限公司 Data searching method and device

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1134567A (en) 1995-11-29 1996-10-30 陈肇雄 Morphological parsing algorithm for English-Chinese translation system
US6507846B1 (en) 1999-11-09 2003-01-14 Joint Technology Corporation Indexing databases for efficient relational querying
CN1617135A (en) 2003-11-10 2005-05-18 摩托罗拉公司 Method and system for providing two-way bilingual dictionary

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1134567A (en) 1995-11-29 1996-10-30 陈肇雄 Morphological parsing algorithm for English-Chinese translation system
US6507846B1 (en) 1999-11-09 2003-01-14 Joint Technology Corporation Indexing databases for efficient relational querying
CN1617135A (en) 2003-11-10 2005-05-18 摩托罗拉公司 Method and system for providing two-way bilingual dictionary

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
QHFY英汉机器翻译系统的词典设计. 陈圣信,包培文.中文信息学报,第6卷第3期. 1992
机器可读词典的快速查找技术. 张永奎.中文信息学报,第8卷第2期. 1993
英文词汇的近形检索和通配查找. 王海源,杨黎波.上海师范大学学报(自然科学版),第28卷第3期. 1999

Also Published As

Publication number Publication date Type
CN1776688A (en) 2006-05-24 application

Similar Documents

Publication Publication Date Title
Morrison PATRICIA—practical algorithm to retrieve information coded in alphanumeric
US5724593A (en) Machine assisted translation tools
US5992737A (en) Information search method and apparatus, and medium for storing information searching program
Ferragina et al. The string B-tree: a new data structure for string search in external memory and its applications
US5551049A (en) Thesaurus with compactly stored word groups
US5794177A (en) Method and apparatus for morphological analysis and generation of natural language text
US5099426A (en) Method for use of morphological information to cross reference keywords used for information retrieval
US20100011016A1 (en) Dictionary compilations
US20030078915A1 (en) Generalized keyword matching for keyword based searching over relational databases
US20070198566A1 (en) Method and apparatus for efficient storage of hierarchical signal names
Aoe An efficient digital search algorithm by using a double-array structure
US6877003B2 (en) Efficient collation element structure for handling large numbers of characters
US7809744B2 (en) Method and system for approximate string matching
US20020165707A1 (en) Methods and apparatus for storing and processing natural language text data as a sequence of fixed length integers
Baeza-Yates Introduction to Data Structures and Algorithms Related to Information Retrieval.
US20060101503A1 (en) Method and system for performing searches for television content using reduced text input
US20100082545A1 (en) Compression of sorted value indexes using common prefixes
US6330567B1 (en) Searching system for searching files stored in a hard disk of a personal computer
Krishnan et al. Estimating alphanumeric selectivity in the presence of wildcards
US6430557B1 (en) Identifying a group of words using modified query words obtained from successive suffix relationships
US4775956A (en) Method and system for information storing and retrieval using word stems and derivative pattern codes representing familes of affixes
US6839665B1 (en) Automated generation of text analysis systems
US20030023584A1 (en) Universal information base system
CN102254014A (en) Adaptive information extraction method for webpage characteristics
US20100023514A1 (en) Tokenization platform

Legal Events

Date Code Title Description
C06 Publication
C10 Entry into substantive examination
C14 Grant of patent or utility model
COR Change of bibliographic data

Free format text: CORRECT: ADDRESS; FROM: 100083 HAIDIAN, BEIJING TO: 100085 HAIDIAN, BEIJING

C41 Transfer of patent application or patent right or utility model
ASS Succession or assignment of patent right

Owner name: BEIJING KINGSOFT OFFICE SOFTWARE CO., LTD.

Free format text: FORMER OWNER: BEIJING JINSHAN SOFTWARE CO., LTD.

Effective date: 20140313

C56 Change in the name or address of the patentee