WO2011066757A1 - 五笔输入系统及方法 - Google Patents

五笔输入系统及方法 Download PDF

Info

Publication number
WO2011066757A1
WO2011066757A1 PCT/CN2010/076479 CN2010076479W WO2011066757A1 WO 2011066757 A1 WO2011066757 A1 WO 2011066757A1 CN 2010076479 W CN2010076479 W CN 2010076479W WO 2011066757 A1 WO2011066757 A1 WO 2011066757A1
Authority
WO
WIPO (PCT)
Prior art keywords
code
input
digit code
words
core
Prior art date
Application number
PCT/CN2010/076479
Other languages
English (en)
French (fr)
Inventor
张靖
邓欣
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Priority to SG2012039806A priority Critical patent/SG181142A1/en
Priority to RU2012126667/08A priority patent/RU2510524C2/ru
Priority to BR112012013166A priority patent/BR112012013166A2/pt
Publication of WO2011066757A1 publication Critical patent/WO2011066757A1/zh
Priority to US13/480,323 priority patent/US20120242516A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/02Input arrangements using manually operated switches, e.g. using keyboards or dials
    • G06F3/023Arrangements for converting discrete items of information into a coded form, e.g. arrangements for interpreting keyboard generated codes as alphanumeric codes, operand codes or instruction codes
    • G06F3/0233Character input methods
    • G06F3/0237Character input methods using prediction or retrieval techniques

Definitions

  • the present invention relates to an input method, and more particularly to a five-stroke input system and method.
  • the Wubi type input method is a Chinese character input method invented by Professor Wang Yongmin according to Chinese characters. It is one of the most commonly used Chinese input methods in China and some Southeast Asian countries.
  • the basic principle of Wubi Chinese characters are composed of strokes or radicals.
  • the roots can be the radicals of Chinese characters, or they can be part of the radicals, or even strokes. After taking out these roots, they are classified according to certain rules; these roots are then distributed on the keyboard according to scientific principles as the basic unit for inputting Chinese characters.
  • the Wubi input method can quickly find the user's expected terms with its low bit rate, and can greatly improve the input speed while familiar with the input method. However, this requires the user to skillfully disassemble the words, and generally requires three. Four to five strokes can be used to quickly determine the word you need. In the unskilled case, the user can only obtain a large number of candidate terms by one code or two codes, and through filtering to find the required words, reducing the input speed.
  • a five-stroke input system comprising: a cached lexicon, storing term information and index information of common words of one-digit code and two-digit code; a core vocabulary storing all five-encoded entry information and index information; For inputting a one-digit code or two-digit code, extracting a word from the cached lexicon according to the index information in the cached lexicon, and displaying it; when inputting a three-digit code or a four-digit code, according to the core word The index information in the library is fetched from the core lexicon and displayed.
  • the cached vocabulary includes: a cache encoding index area, storing index information of common words; a cache entry storage area, storing term information of common words, and the common words are indexed by the first two codes of five strokes. And stored in order of high frequency to low frequency.
  • the core vocabulary comprises: a core coding index area, storing index information of all five-encoded vocabulary information; a core vocabulary storage area storing all five-encoded vocabulary information, all of which are encoded by five strokes thereof;
  • the first three codes are stored in order as an index, and the same terms of the first three codes are stored in order of high frequency to low frequency.
  • the word-taking module comprises: an index calculation module, which obtains index information according to the input five-stroke code; and a candidate word output module that obtains and displays the term according to the index information.
  • the method further includes a judging module, configured to determine, according to the input one-digit code or the two-digit code, whether there is a term expected by the user in the cached lexicon.
  • a judging module configured to determine, according to the input one-digit code or the two-digit code, whether there is a term expected by the user in the cached lexicon.
  • a five-stroke input method is also available.
  • a five-stroke input method includes the following steps: receiving five-stroke code input; when the input five-stroke code is a one-digit code or a two-digit code, the entry information and index information of a common word storing one-digit code and two-digit code
  • the cached vocabulary takes words; when the input five-stroke code is a three-digit code or a four-digit code, the words are taken from the core vocabulary storing all the five-encoded entry information and index information.
  • the step of fetching the cached lexicon further comprises: determining whether the cached lexicon contains a term expected by the user, and if the cached vocabulary does not include the term expected by the user, then the core lexicon is Take words.
  • the step of taking a word from the cached lexicon is: indexing the words in the cached lexicon with the first two codes encoded by five strokes, storing the words according to the frequency of the words from high to low, and converting the input five-character code into Index information, and then read and display the terms in order according to the index information.
  • the step of extracting words from the core vocabulary is specifically: storing the first three codes of the five-character code in the core lexicon as an index order, and following the word frequency from high to low for the same words of the first three codes.
  • Sequential storage if the input five-stroke code is a three-digit code, the three-digit code is converted into index information, and then the terms obtained according to the index information are sorted according to the word frequency from high to low; if the input five-digit code is four-digit The code, the fourth code in the entry obtained by the three-code input is filtered with all the entries that do not match the fourth code input by the user, and all the terms corresponding to the four-code input are obtained, and the obtained entry is according to the word frequency. Sorted by high to low.
  • the step of taking a word from the core vocabulary further comprises: if the input five-stroke code is a one-bit code or a two-digit code, converting the one-bit code or the two-digit code into index information, and then according to the index information
  • the obtained terms are sequentially read and displayed in the order in which the entries are stored in the core thesaurus.
  • the cached vocabulary can be preferentially retrieved according to the user input, so that when the user inputs a one-digit code or two-digit code, the common vocabulary is displayed, and the user's expected entry hit rate can be increased without looking for a large number of terms. Increase the speed of the five strokes.
  • the words are retrieved from the cached vocabulary, so the cached vocabulary can be preferentially retrieved according to the user input, so that when the user inputs one code or two codes, the common vocabulary is displayed without searching for a large number of terms. It can increase the user's expected entry hit rate and increase the speed of five strokes.
  • FIG. 1 is a schematic structural view of a wubi input system of Embodiment 1;
  • FIG. 3 is a schematic structural diagram of a five-stroke input system of Embodiment 2;
  • FIG. 1 is a schematic structural diagram of a five-stroke input system of the embodiment.
  • the Wubi input system includes a word-taking module 100, a core vocabulary 200, and a cached vocabulary 300.
  • the core thesaurus 200 stores all the five-encoded entry information and index information
  • the cached thesaurus 300 stores the entry information and index information of the common words of the one-digit code and the two-digit code.
  • the word-taking module 100 retrieves a word from the cached vocabulary 300 according to the index information in the cached vocabulary 300; when a three-digit code or a four-digit code is input, the vocabulary module 100 according to the core
  • the index information in the thesaurus 200 is taken from the core thesaurus 200.
  • the word retrieval module 100 includes an index calculation module 110 and a candidate word output module 120.
  • the index calculation module 110 converts the five-stroke code into index information according to the user input. If the one-digit code or the two-digit code is converted into index information for taking words from the cached vocabulary 300, the three-digit code or the four-digit code is converted into index information for taking words from the core vocabulary 200.
  • the candidate word output module 120 obtains a term based on the index information and performs display output.
  • the core thesaurus 200 includes a core encoding index area 210 and a core term storage area 220.
  • the core coding index area 210 stores index information of all the five-encoded term information;
  • the core term storage area 220 stores all the five-encoded term information, and all the terms are stored in order with the top three codes of the five-character code as an index, The same terms of the first three codes are stored in order of high frequency to low frequency.
  • the cached thesaurus 300 includes a cached index area 310 and a cached entry store 320.
  • the cache code index area 310 stores index information of common words;
  • the cache entry storage area 320 stores the entry information of the common words, and the common words are indexed by the first two codes of the five strokes, and are stored in descending order according to the word frequency.
  • the core coding index area 210 and the buffer code index area 310 are both consecutive array areas, and each element of the array occupies 4 bytes, wherein the record corresponding to the five strokes is in the core entry storage area. 220 or the starting position in the cache entry store 320.
  • the index information is the starting position of the entry stored in the index group.
  • the index information stored in the core encoding index area 210 refers to the starting position of the term storage in the core entry storage area 220; the cache encoding index
  • the index information stored in the area 310 refers to the starting position of the term storage in the cache entry storage area 320.
  • the core entry storage area 220 and the cache entry storage area 320 hold specific item information, including the five-stroke code of the entry, Unicode text, word frequency and other additional information.
  • the five-stroke code of the entry is used to compare with the user input to determine whether the match is matched.
  • the Unicode text is used to display the entry.
  • the word frequency can be pre-defined according to the statistical result, or can be updated in real time during the user's use, indicating the frequency of use of the entry. Therefore, the words with higher word frequency are very likely to satisfy the user's expectations.
  • Unicode is a text encoding standard, each character is represented by two bytes, is a fixed-length 2Byte multi-language (language) character set encoding, also belongs to the prior art
  • the corresponding five-stroke input method includes the following steps:
  • S10 Receive five-coded input. On the 25 keys of a to y on the keyboard, there are roots according to the rules of the five-stroke input method. According to the input of the keyboard letters, the words combined by the roots can be obtained.
  • the processing method of this embodiment receives one to four arbitrary combinations of a to y input by the user.
  • step S20 Determine that the five-stroke code is a bit code input. If it is a one-bit code or a two-digit code, the process proceeds to step S30; if it is a three-digit code or a four-digit code, the process proceeds to step S50.
  • S30 Take a word from the cache vocabulary 300 and display it. This step deals with one-bit code or two-digit code input. Since the core vocabulary 200 contains a large number of entries, the bit rate is higher when one-digit or two-digit code is input, so the cache vocabulary 300 is created. Commonly used terms, and these commonly used terms are indexed with inputs of no more than two digits.
  • strCode represents the input code of the user, the length is from 1 to 4, and Index represents the array subscript converted into:
  • Index + (StrCode[1] –‘a’) + 1.
  • the cache code index area 310 array subscript can be obtained by the five-stroke encoding, thereby obtaining the starting position of the entry corresponding to the five-stroke code in the cache entry storage area 320. Since the entries in the cache entry storage area 320 are indexed according to two codes, and sorted by word frequency.
  • the mode in which the word retrieval module 100 retrieves words from the cached vocabulary 300 is:
  • the starting position of the entry is obtained according to the array subscript corresponding to the one-digit code or the two-digit code, and the words are taken and displayed according to the order in which the terms are stored.
  • the entry corresponding to "aa” in the cached thesaurus 300 stores "style” (aa), “work” (aawt) "tool” (aahw), "engineering” in order of word frequency from high to low. Aatk), “avail” (aaa), “craft” (aaan), “salary” (aauq), “factory” (aadg), “worker” (aaww), and “work” (aaa) ten terms. Then, when the word is taken, the words can be retrieved from the cached vocabulary 300 in order from the starting position of the "style” storage.
  • the word-taking module 100 does not take a word from the buffer vocabulary 300.
  • S50 Take a word from the core vocabulary 200 and display it. This step deals with three-digit or four-digit code input. When the user inputs a three-digit code or a four-digit code, the re-encoding rate of the entry is already low, so that the core lexicon 200 can be directly indexed.
  • the subscripts of each element in the array have a one-to-one correspondence with the Wubi code.
  • the following method can be used to establish a five-stroke code and a core code index area 210 array subscript contact:
  • strCode represents the input code of the user, the length is from 1 to 4, and Index represents the array subscript converted into:
  • Index + (StrCode[1] –‘a’) * (25 + 1) + 1 ;
  • the above sorting is a typical lexicographical order.
  • the core encoding index area 210 array subscript can be obtained by five-pass encoding, thereby obtaining the starting position of the corresponding poem in the core vocabulary storage area 220. (is prior art)
  • the mode in which the word retrieval module 100 fetches words from the core thesaurus 200 is:
  • the bit rate of the one-digit or two-digit code input is also reduced to a certain extent, thereby improving the hit rate of the entry.
  • the probability of using the two-digit code input to obtain the expected entry is very high, or the probability of taking the word in the core lexicon is very low, which can satisfy the fast word retrieval in most cases.
  • the determining module 400 is added to the foregoing embodiment. As shown in FIG. 3, after determining whether the user inputs a one-digit code or a two-digit code, whether the cached thesaurus 300 contains the term expected by the user, if the user turns over When the last page of the cached vocabulary 300 is still paged, it indicates that the cached vocabulary 300 does not contain the terms expected by the user.
  • step S40 is inserted between steps S30 and S50 to determine whether the cached term 300 contains the term expected by the user. If not, the process proceeds to step S50; if so, the entry is output according to the user command, and the word is ended.
  • the cached vocabulary 300 does not contain the term expected by the user, it is likely that the vocabulary is relatively unfamiliar, and the user may choose to continue to page through or add to the three-digit code. Or a four-digit code.
  • step S30 also includes processing for one-digit code or two-digit code input: when the user inputs one In the case of a bit code or a two-digit code, since the entry is first sorted by the first three code index, the start position of the entry is obtained according to the array subscript corresponding to the one-digit code or the two-digit code, and then the order of the entry is stored. Take words and display them. For example, enter “aa” and take the words in the order of "aaa”, “aab” to "aay".
  • the cached vocabulary 300 since the cached vocabulary 300 does not contain the expected terms, it is necessary to enter the core vocabulary 200 for indexing. If the entry is found, the entry is output according to the user command, and the word is ended.

Landscapes

  • Engineering & Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)
  • Input From Keyboards Or The Like (AREA)
  • Telephone Function (AREA)

Description

五笔输入系统及方法 技术领域
本发明涉及输入法,尤其是涉及一种五笔输入系统及方法。
背景技术
五笔字型输入法,简称五笔,是王永民教授发明的一种依照汉字字形进行编码的汉字输入方法,是目前中国以及一些东南亚的国家,最常用的一种汉字输入法之一。
五笔的基本原理:汉字都是由笔划或部首组成的。为了输入这些汉字,我们把汉字拆成一些最常用的基本单位,叫做字根,字根可以是汉字的偏旁部首,也可以是部首的一部分,甚至是笔划。取出这些字根后,把它们按一定的规律分类;再把这些字根依据科学原理分配在键盘上,作为输入汉字的基本单位。五笔基本字根有130种,加上一些基本字根的变型,共有200个左右,这些字根分布在除Z之外的25个键上。当要输入汉字时,我们就按照汉字的书写顺序依次按键盘上与字根对应的键,组成一个编码;系统根据输入字根组成的编码,在五笔输入法的字库中检索出所要的字。
五笔输入法以其重码率低而能够快速找到用户预期的词条,在熟悉该输入法的前提下可大大提高输入的速度,不过这需要用户能够熟练地拆解字词,同时一般需要三到四个五笔码才能快速确定所需要的词。在不熟练的情况下,用户只能通过一码或两码来获得大量的候选词条,并通过筛选来找到需要的词条,降低了输入速度。
技术问题
鉴于此,有必要针对传统的五笔输入法在一码或两码输入情况下重码率较高,影响输入速度的问题,提供一种能够提高用户输入速度的五笔输入系统及方法 。
技术解决方案
一种五笔输入系统,包括:缓存词库,存储一位码和两位码的常用词的词条信息及索引信息;核心词库,存储所有五笔编码的词条信息及索引信息;取词模块,用于当输入一位码或两位码时,根据所述缓存词库中的索引信息从缓存词库中取词并显示;当输入三位码或四位码时,根据所述核心词库中的索引信息从核心词库中取词并显示。
优选地,所述缓存词库包括:缓存编码索引区,存储常用词的索引信息;缓存词条存储区,存储常用词的词条信息,所述常用词以五笔编码的前两码作为索引,并按照词频由高到低顺序存储。
优选地,所述核心词库包括:核心编码索引区,存储所有五笔编码的词条信息的索引信息;核心词条存储区,存储所有五笔编码的词条信息,所有词条以其五笔编码的前三码作为索引按序存储,对前三码相同的词条按照词频由高到低顺序存储。
优选地,所述取词模块包括:索引计算模块,根据输入的五笔编码得到索引信息;候选词输出模块,根据索引信息获得并显示词条。
优选地,还包括判断模块,所述判断模块用于根据输入的一位码或二位码判断缓存词库中是否存在用户预期的词条。
此外还提供一种五笔输入方法。
一种五笔输入方法,包括如下步骤:接收五笔编码输入;当所输入的五笔编码为一位码或二位码时,向存储了一位码和两位码的常用词的词条信息及索引信息的缓存词库取词;当所输入的五笔编码为三位码或四位码时,向存储了所有五笔编码的词条信息及索引信息的核心词库取词。
优选地,向所述缓存词库取词的步骤之后还包括:判断所述缓存词库是否包含用户预期的词条,若缓存词库未包含用户预期的词条,则向所述核心词库取词。
优选地,所述向缓存词库取词的步骤具体是:将缓存词库中的词以五笔编码的前两码作为索引,并按照词频由高到低进行存储,将输入的五笔编码转换为索引信息,然后根据索引信息将词条按序读取并显示。
优选地,所述向核心词库取词的步骤具体是:将核心词库中的词以五笔编码的前三码作为索引顺序存储,并对前三码相同的词条按照词频由高到低顺序存储,如果输入的五笔编码为三位码,则将该三位码转换为索引信息,然后根据索引信息获得的词条按照词频由高到低进行排序显示;如果输入的五笔编码为四位码,则将由三码输入获得的词条中第四码与用户输入的第四码不匹配的所有词条过滤,获得与该四码输入相应的所有词条,将所获得的词条按照词频由高到低进行排序显示。
优选地,所述向核心词库取词的步骤还包括:如果输入的五笔编码为一位码或二位码,则将该一位码或二位码转换为索引信息,然后根据索引信息将获得的词条按照词条在核心词库中的存储顺序依次读取并显示。
有益效果
加入缓存词库后,根据用户输入可优先检索缓存词库,使用户在输入一位码或两位码时,显示常用词条,不用查找大量词条即可增大用户预期词条命中率,提高五笔输入速度。
由于优先处理一位码或二位码,向缓存词库取词,因此根据用户输入可优先检索缓存词库,使用户在输入一码或两码时,显示常用词条,不用查找大量词条即可增大用户预期词条命中率,提高五笔输入速度。
附图说明
图1为实施例1的五笔输入系统的结构示意图;
图2为实施例1的五笔输入方法的流程图;
图3为实施例2的五笔输入系统的结构示意图;
图4为实施例2的五笔输入方法的流程图。
本发明的实施方式
实施例1
如图1所示,为本实施例的五笔输入系统结构示意图。该五笔输入系统包括取词模块100、核心词库200以及缓存词库300。核心词库200存储所有五笔编码的词条信息及索引信息,缓存词库300存储一位码和两位码的常用词的词条信息及索引信息。当输入一位码或两位码时,取词模块100根据缓存词库300中的索引信息从缓存词库300中取词;当输入三位码或四位码时,取词模块100根据核心词库200中的索引信息从核心词库200中取词。
取词模块100包括索引计算模块110和候选词输出模块120。其中索引计算模块110根据用户输入将五笔编码转换为索引信息。如将一位码或二位码转换为向缓存词库300取词的索引信息,将三位码或四位码转换为向核心词库200取词的索引信息。候选词输出模块120根据索引信息获得词条并进行显示输出。
核心词库200包括核心编码索引区210和核心词条存储区220。核心编码索引区210存储所有五笔编码的词条信息的索引信息;核心词条存储区220存储所有五笔编码的词条信息,所有词条以其五笔编码的前三码作为索引按序存储,对前三码相同的词条按照词频由高到低顺序存储。
缓存词库300包括缓存编码索引区310和缓存词条存储区320。缓存编码索引区310存储常用词的索引信息;缓存词条存储区320存储常用词的词条信息,常用词以五笔编码的前两码作为索引,并按照词频由高到低顺序存储。
本实施例中核心编码索引区210和缓存编码索引区310都是一段连续的数组区域,数组的每个元素占4个字节,其中记录的是五笔编码所对应词条在核心词条存储区220或缓存词条存储区320中的起始位置。
索引信息即是指数组中存储的词条的起始位置,相应的,核心编码索引区210中存储的索引信息即是指核心词条存储区220中词条存储的起始位置;缓存编码索引区310中存储的索引信息即是指缓存词条存储区320中词条存储的起始位置。
核心词条存储区220和缓存词条存储区320中保存的是具体词条信息,包括词条的五笔编码,Unicode文本,词频和其它一些附加信息。词条的五笔编码用于与用户输入对比确定是否匹配,Unicode文本用于显示词条,词频则可根据统计结果预先定义,也可在用户使用过程中实时更新,表示该词条使用的频率,因此词频较高的词条满足用户预期的可能性非常大。(Unicode是一种文本编码标准,每个字符用两个字节表示,是一种定长的2Byte多文种(语言)字符集编码,也属于现有技术)
相应的五笔输入方法,如图2所示包括如下步骤:
S10:接收五笔编码输入。在键盘上a至y的25个按键上按照五笔输入法既定的规则分布有字根,根据键盘字母的输入可得到字根所组合成的词条。本实施例的处理方法接收用户输入的一至四个由a至y的任意组合。
S20:判断五笔编码是几位码输入。如果是一位码或二位码,则转入步骤S30;如果是三位码或四位码,则转入步骤S50。
S30:向缓存词库300取词并显示。本步骤处理一位码或二位码输入,由于核心词库200包含了大量的词条,在一位码或两位码输入时,重码率较高,因此建立缓存词库300,收录较常用的词条,并且这些常用的词条是以不多于两位码的输入来进行索引的。
在缓存词库300中,所有词条是按照其前两码作为索引的,所以缓存编码索引区310的索引范围是从“a”至“yy”,因而数组包含25 + 252 =650个元素。
因此可建立一位码或两位码的五笔编码与缓存编码索引区310数组下标的联系。strCode代表用户输入的编码,长度从1到4,Index代表转换成的数组下标,则:
Index = (StrCode[0] – ‘a’) *(25 + 1)+1;
If(编码长度 >= 2) Index += (StrCode[1] –‘a’) + 1。
根据上述公式计算得出的结果如下:
编码:a 下标:1
编码:aa 下标:2
编码:ab 下标:3
……
编码:y 下标:625
编码:ya 下标:626
……
编码:yy 下标:650
根据上述公式可由五笔编码获得缓存编码索引区310数组下标,进而获得与该五笔编码在缓存词条存储区320中对应的词条的起始位置。由于缓存词条存储区320中的词条是按照两码索引,且以词频排序。
因此取词模块100从缓存词库300中取词的模式是:
当用户输入一位码或二位码时,按照该一位码或二位码对应的数组下标获得词条的起始位置,并按照词条存储的顺序取词并显示。
如在缓存词库300中与“aa”对应的词条仅以词频由高到低按序存储了“式”(aa)、“工作”(aawt)“工具”(aahw)、“工程”(aatk)、“工业”(aaog)、“工艺”(aaan)、“工资”(aauq)、“工厂”(aadg)、“工人”(aaww)以及“工”(aaa)十个词条。则在取词时就可以从“式”存储的起始位置开始顺序从缓存词库300中取词。
若输入三位码以上,取词模块100不会向缓冲词库300取词。
按照五笔用户的输入习惯,一般很少翻到两页以后去找候选词,因此在本实施例中,优选地,在缓存词库300中,每个五笔编码对应的索引至多存储10个词条。因此缓存词库300中至多存储650*10=6500条词条。
S50:向核心词库200取词并显示。本步骤处理三位码或四位码输入。当用户输入为三位码或四位码时,词条的重码率就已经很低了,因此可以直接进入核心词库200进行索引。
在核心词库200中,所有词条是按照其前三码作为索引的,所以核心编码索引区210的索引范围是从“a”至“yyy”,因而数组包含25 + 252 + 253 =16275个元素。数组中每个元素的下标都与五笔编码建立一一对应的关系。
如可通过如下方法建立五笔编码与核心编码索引区210数组下标联系:
strCode代表用户输入的编码,长度从1到4,Index代表转换成的数组下标,则:
Index = (StrCode[0] – ‘a’) * (252 + 25 + 1) + 1 ;
If(编码长度 >= 2) Index += (StrCode[1] –‘a’) * (25 + 1) + 1 ;
If ( 编码长度 >= 3) Indxe += (strCode[2] –‘a’) + 1。
根据上述公式计算得出的结果如下:
编码:a 下标:1
编码:aa 下标:2
编码:aaa 下标:3
编码:aab 下标:4
编码:aac 下标:5
编码:aad 下标:6
……
编码:y 下标:15625
编码:ya 下标:15626
……
编码:yad 下标:15630
……
编码:yyy 下标:16275
上述排序是典型的字典序,根据上述对应关系可由五笔编码获得核心编码索引区210数组下标,进而获得与该五笔编码在核心词条存储区220中对应的词条的起始位置。(是现有技术)
因此取词模块100从核心词库200中取词的模式是:
当用户输入三码时,则将前三码相同的词条按照词频由高到低排序,顺序取出并显示。如输入“fnt”,若“fntj”对应的“专利”词频为1000、“fnta” 对应的“专长”词频为 500、“fnnn” 对应的“专书”词频为 200,则在核心词库200中,“专利”、“专长”以及“专书”依次存储,取词时依次取出并显示即可。
当用户输入四位码时,将由三码输入获得的词条中第四码与用户输入的第四码不匹配的所有词条过滤,剩余的词条即是与该四码输入相应的所有词条。
实施例2
由于五笔输入法本身的重码率较低,在加入缓存词库300后,将一位码或二位码输入的重码率也降低到一定程度,提高了词条的命中率。一般来说,利用二位码输入获得预期词条的几率很高,或者说需要到核心词库中取词的几率很低,可以满足大部分情形下快速取词。但是用户不可能熟记哪些字词在缓存词库300中有,哪些没有,因此仍然用户存在输入二位码后,用户翻到最后一页也未找到预期词条的情况。按照上述实施例的处理方法,若未在缓存词库300中找到预期词条,则需要用户继续输入以构成三位码或四位码,以从核心词库200中取词,或者结束取词。因此本实施例在上述实施例的基础上加入判断模块400,如图3所示,判断用户输入一位码或二位码后,缓存词库300是否包含用户预期的词条,如果用户翻到缓存词库300最后一页的时候还在翻页,表示缓存词库300未包含用户预期的词条。
相应地,如图4所示,在上述实施例的基础上,在步骤S30与S50之间插入步骤S40:判断缓存词库300中是否包含用户预期的词条。如果否,则转入步骤S50;如果是,则根据用户命令输出词条,取词结束。
因此,当用户输入一位码或二位码时,若缓存词库300未包含用户预期的词条,则很有可能该词条较为生僻,用户可以选择继续翻页查找或者补充至三位码或四位码。
若选择继续翻页查找,由于缓存词库300存储的词条有限,需要转入核心词库200取词,即步骤S30中还包括对于一位码或二位码输入的处理:当用户输入一位码或二位码时,由于词条先是按前三码索引排序的,因此按照该一位码或二位码对应的数组下标获得词条的起始位置,然后按照词条存储的顺序取词并显示。比如输入“aa”,按照“aaa”、“aab”至“aay”的顺序依次取词显示。
不管用户如何选择,因缓存词库300未包含预期词条,因此有必要进入核心词库200进行索引。若找到词条,则根据用户命令输出词条,取词结束。
以上所述实施例仅表达了本发明的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对本发明专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本发明构思的前提下,还可以做出若干变形和改进,这些都属于本发明的保护范围。因此,本发明专利的保护范围应以所附权利要求为准。

Claims (10)

  1. 一种五笔输入系统,其特征在于,包括:
    缓存词库,存储一位码和两位码的常用词的词条信息及索引信息;
    核心词库,存储所有五笔编码的词条信息及索引信息;
    取词模块,用于当输入一位码或两位码时,根据所述缓存词库中的索引信息从缓存词库中取词;当输入三位码或四位码时,根据所述核心词库中的索引信息从核心词库中取词。
  2. 根据权利要求1所述的五笔输入系统,其特征在于,所述缓存词库包括:
    缓存编码索引区,存储常用词的索引信息;
    缓存词条存储区,存储常用词的词条信息,所述常用词以五笔编码的前两码作为索引,并按照词频由高到低顺序存储。
  3. 根据权利要求1或2所述的五笔输入系统,其特征在于,所述核心词库包括:
    核心编码索引区,存储所有五笔编码的词条信息的索引信息;
    核心词条存储区,存储所有五笔编码的词条信息,所有词条以其五笔编码的前三码作为索引按序存储,对前三码相同的词条按照词频由高到低顺序存储。
  4. 根据权利要求1或2所述的五笔输入系统,其特征在于,所述取词模块包括:
    索引计算模块,根据输入的五笔编码得到索引信息;
    候选词输出模块,根据索引信息获得并显示词条。
  5. 根据权利要求1所述的五笔输入系统,其特征在于,还包括判断模块,所述判断模块用于根据输入的一位码或二位码判断缓存词库中是否存在用户预期的词条。
  6. 一种五笔输入方法,包括如下步骤:
    接收五笔编码输入;
    当所输入的五笔编码为一位码或二位码时,向存储了一位码和两位码的常用词的词条信息及索引信息的缓存词库取词;
    当所输入的五笔编码为三位码或四位码时,向存储了所有五笔编码的词条信息及索引信息的核心词库取词。
  7. 根据权利要求6所述的五笔输入方法,其特征在于,向所述缓存词库取词的步骤之后还包括:判断所述缓存词库是否包含用户预期的词条,若缓存词库未包含用户预期的词条,则向所述核心词库取词。
  8. 根据权利要求6或7所述的五笔输入方法,其特征在于,所述向缓存词库取词的步骤具体是:将缓存词库中的词以五笔编码的前两码作为索引,并按照词频由高到低进行存储,将输入的五笔编码转换为索引信息,然后根据索引信息将词条按序读取并显示。
  9. 根据权利要求6或7所述的五笔输入方法,其特征在于,所述向核心词库取词的步骤具体是:将核心词库中的词以五笔编码的前三码作为索引顺序存储,并对前三码相同的词条按照词频由高到低顺序存储,
    如果输入的五笔编码为三位码,则将该三位码转换为索引信息,然后根据索引信息获得的词条按照词频由高到低进行排序显示;
    如果输入的五笔编码为四位码,则将由三码输入获得的词条中第四码与用户输入的第四码不匹配的所有词条过滤,获得与该四码输入相应的所有词条,将所获得的词条按照词频由高到低进行排序显示。
  10. 根据权利要求9所述的五笔输入方法,其特征在于,所述向核心词库取词的步骤还包括:如果输入的五笔编码为一位码或二位码,则将该一位码或二位码转换为索引信息,然后根据索引信息将获得的词条按照词条在核心词库中的存储顺序依次读取并显示。
PCT/CN2010/076479 2009-12-02 2010-08-31 五笔输入系统及方法 WO2011066757A1 (zh)

Priority Applications (4)

Application Number Priority Date Filing Date Title
SG2012039806A SG181142A1 (en) 2009-12-02 2010-08-31 Five strokes input system and method
RU2012126667/08A RU2510524C2 (ru) 2009-12-02 2010-08-31 СИСТЕМА И СПОСОБ ВВОДА WuBi
BR112012013166A BR112012013166A2 (pt) 2009-12-02 2010-08-31 método e sistema de entrada de wubi
US13/480,323 US20120242516A1 (en) 2009-12-02 2012-05-24 Wubi input system and method

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN200910194363.2A CN101739142B (zh) 2009-12-02 2009-12-02 五笔输入系统及方法
CN200910194363.2 2009-12-02

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US13/480,323 Continuation US20120242516A1 (en) 2009-12-02 2012-05-24 Wubi input system and method

Publications (1)

Publication Number Publication Date
WO2011066757A1 true WO2011066757A1 (zh) 2011-06-09

Family

ID=42462695

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2010/076479 WO2011066757A1 (zh) 2009-12-02 2010-08-31 五笔输入系统及方法

Country Status (6)

Country Link
US (1) US20120242516A1 (zh)
CN (1) CN101739142B (zh)
BR (1) BR112012013166A2 (zh)
RU (1) RU2510524C2 (zh)
SG (1) SG181142A1 (zh)
WO (1) WO2011066757A1 (zh)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101739142B (zh) * 2009-12-02 2015-01-14 深圳市世纪光速信息技术有限公司 五笔输入系统及方法
CN102314334A (zh) * 2010-06-30 2012-01-11 百度在线网络技术(北京)有限公司 一种用于缓存用户对应用程序输入的内容的方法和设备
CN102467248B (zh) * 2010-11-10 2016-06-08 深圳市世纪光速信息技术有限公司 减少五笔输入法中无意义词自动上屏显示的方法
CN105549758A (zh) * 2015-12-23 2016-05-04 天津天地伟业数码科技有限公司 一种嵌入式录像设备的汉字五笔输入方法
US10217030B2 (en) * 2017-06-14 2019-02-26 International Business Machines Corporation Hieroglyphic feature-based data processing

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1217500A (zh) * 1998-11-03 1999-05-26 杨建伟 形音码输入法
CN1218218A (zh) * 1998-02-13 1999-06-02 邱国权 汉字部首笔顺输入码
CN1236914A (zh) * 1999-01-01 1999-12-01 钟明华 中文词组输入法
CN1447209A (zh) * 2002-03-25 2003-10-08 朱庆光 手机双笔数码汉字输入法
CN101739142A (zh) * 2009-12-02 2010-06-16 腾讯科技(深圳)有限公司 五笔输入系统及方法

Family Cites Families (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1039666C (zh) * 1993-11-06 1998-09-02 黄飞梦 基于两笔形与两笔符的汉字输入方法及键盘
US6970599B2 (en) * 2002-07-25 2005-11-29 America Online, Inc. Chinese character handwriting recognition system
US7165021B2 (en) * 2001-06-13 2007-01-16 Fujitsu Limited Chinese language input system
US6847311B2 (en) * 2002-03-28 2005-01-25 Motorola Inc. Method and apparatus for character entry in a wireless communication device
JP4558482B2 (ja) * 2002-06-05 2010-10-06 ス、ロンビン 各国語文字情報の最適化デジタル操作的コード化及び入力の方法、そして、その情報処理システム
WO2004066600A1 (en) * 2003-01-22 2004-08-05 Min-Kyum Kim Apparatus and method for inputting alphabet characters
US7088861B2 (en) * 2003-09-16 2006-08-08 America Online, Inc. System and method for chinese input using a joystick
US7756337B2 (en) * 2004-01-14 2010-07-13 International Business Machines Corporation Method and apparatus for reducing reference character dictionary comparisons during handwriting recognition
US20060018545A1 (en) * 2004-07-23 2006-01-26 Lu Zhang User interface and database structure for Chinese phrasal stroke and phonetic text input
TWI273450B (en) * 2005-07-12 2007-02-11 Asustek Comp Inc Method and apparatus for searching data
US9104244B2 (en) * 2009-06-05 2015-08-11 Yahoo! Inc. All-in-one Chinese character input method
US8896470B2 (en) * 2009-07-10 2014-11-25 Blackberry Limited System and method for disambiguation of stroke input

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1218218A (zh) * 1998-02-13 1999-06-02 邱国权 汉字部首笔顺输入码
CN1217500A (zh) * 1998-11-03 1999-05-26 杨建伟 形音码输入法
CN1236914A (zh) * 1999-01-01 1999-12-01 钟明华 中文词组输入法
CN1447209A (zh) * 2002-03-25 2003-10-08 朱庆光 手机双笔数码汉字输入法
CN101739142A (zh) * 2009-12-02 2010-06-16 腾讯科技(深圳)有限公司 五笔输入系统及方法

Also Published As

Publication number Publication date
CN101739142B (zh) 2015-01-14
BR112012013166A2 (pt) 2016-03-01
SG181142A1 (en) 2012-07-30
RU2012126667A (ru) 2014-01-10
CN101739142A (zh) 2010-06-16
RU2510524C2 (ru) 2014-03-27
US20120242516A1 (en) 2012-09-27

Similar Documents

Publication Publication Date Title
WO2011066757A1 (zh) 五笔输入系统及方法
WO2004109492A1 (fr) Procede et appareil de traitement et de representation d'objets
CN1282901A (zh) 利用数字键输入中文地址的方法
JPH0122660B2 (zh)
CN101539433A (zh) 导航系统中拼音首字母加声调检索的方法及装置
WO2006074586A1 (fr) Technologie d'extraction de chaines de caracteres marques de bits
TW200947241A (en) Database indexing algorithm and method and system for database searching using the same
JPH056398A (ja) 文書登録装置及び文書検索装置
JPH0991303A (ja) データ管理装置
JPS6217794B2 (zh)
EP1522027A2 (en) Method and system of creating and using chinese language data and user-corrected data
JPH08339376A (ja) 外国語検索装置及び情報検索システム
TW200846946A (en) A method for performing full text searching in files containing 4-byte characters
KR860000681B1 (ko) 한글/한자 워드프로 세서
Xiang A Brief History of the Chinese Language III: From Middle Chinese to Modern Chinese Phonetic System
TWI230341B (en) Kanji searching method using codes
CN1048346C (zh) 词典检索装置
WO2018228101A1 (zh) 基于汉语含义的汉语编码方法及系统和介质设备
JPS60168233A (ja) 単語辞書装置
JPS63276630A (ja) 称呼類似検索用デ−タベ−スシステム
JPS61267828A (ja) 情報登録検索装置
CN1165996A (zh) 中华随意汉字输入法及其键盘
JPH03232063A (ja) 電子辞書の検索方法
JPH06187371A (ja) 圧縮地名データの格納方法及び読み出し方法
JP2001034606A (ja) 中国語入力装置及び中国語入力方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 10834190

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 1201002548

Country of ref document: TH

NENP Non-entry into the national phase

Ref country code: DE

WWE Wipo information: entry into national phase

Ref document number: 2012126667

Country of ref document: RU

REG Reference to national code

Ref country code: BR

Ref legal event code: B01A

Ref document number: 112012013166

Country of ref document: BR

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 02/11/2012)

122 Ep: pct application non-entry in european phase

Ref document number: 10834190

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 112012013166

Country of ref document: BR

Kind code of ref document: A2

Effective date: 20120531