WO2008028421A1 - Method for obtaining new encode character string, inputting method system and word base generation device - Google Patents

Method for obtaining new encode character string, inputting method system and word base generation device Download PDF

Info

Publication number
WO2008028421A1
WO2008028421A1 PCT/CN2007/070518 CN2007070518W WO2008028421A1 WO 2008028421 A1 WO2008028421 A1 WO 2008028421A1 CN 2007070518 W CN2007070518 W CN 2007070518W WO 2008028421 A1 WO2008028421 A1 WO 2008028421A1
Authority
WO
WIPO (PCT)
Prior art keywords
word
user
input
encoded
word frequency
Prior art date
Application number
PCT/CN2007/070518
Other languages
French (fr)
Chinese (zh)
Inventor
Qi Guo
Zijian Tong
Lei Yang
Original Assignee
Beijing Sogou Technology Development Co., Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Sogou Technology Development Co., Ltd. filed Critical Beijing Sogou Technology Development Co., Ltd.
Publication of WO2008028421A1 publication Critical patent/WO2008028421A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding

Definitions

  • the present invention relates to the field of input methods, and in particular, to a method for acquiring a new encoded character string of an input method word, an input method system, and a thesaurus generating device.
  • the existing input method system is based on the input string of the user input, and matches the words required by the user, for example, Chinese, Japanese, and Korean input method systems.
  • a corresponding code string is set for each word, and the user can only obtain the desired word by inputting the correct code string.
  • the user has a learning process for the correct encoded string. It is difficult to ensure that the correspondence between all encoded strings and words recognized by the user is correct, so the existing input method system improves fault tolerance and satisfies some users.
  • the fuzzy sound solution can solve some of the problems because of the North and South languages, but because each region has its own dialect (especially for the dialects with a large number of dialects in Chinese), so when users use the phonetic code to input words, more or less input code There are times when there are less accurate problems, and the above fuzzy sound solution does not solve all the problems.
  • the technical problem to be solved by the present invention is to provide a new encoded string for obtaining input method words.
  • the method and device can obtain a new code string used by each user, and aggregate the generated lexicon, thereby satisfying the idiom of the user's new code string and improving the hit rate of the user's preferred word.
  • Another object of the present invention is to provide an input method system, which can automatically and automatically acquire the encoded character string used by the user for some words in a simple, convenient, timely and effective manner, and obtain a new coded string used by each user by comparison. .
  • Another object of the present invention is to provide a thesaurus generating apparatus capable of efficiently providing a relatively complete whole dictionary or a new thesaurus including a new encoded character string suitable for user input habits.
  • the present invention provides a method for obtaining a new encoded character string of an input method word, which comprises: extracting a word selected by a user in an input process, and an encoded character string input by a user; The selected word and the encoded string input by the user are compared with the existing thesaurus, wherein the existing thesaurus stores the existing words and their corresponding encoded strings; according to the preset rules, the corresponding words are determined. New encoded string.
  • the method further includes: recording a word selected by the user and a code string input by the user to the user vocabulary; and during the user input process, recording the user word frequency to the user vocabulary, wherein the user word frequency is The user enters frequency information for the word and its corresponding encoded string.
  • the method further includes: according to the current application input by the user, respectively, the corresponding weight corrected statistical word frequency information is used to obtain the user word frequency.
  • the method further includes: collecting a word record of each user having a new encoded character string, the record including the word, the corresponding new coded string, and corresponding word frequency information; removing duplicate words recording.
  • the method further includes: calculating a cumulative word frequency of the user; and removing the encoded character string whose user cumulative word frequency is less than or equal to the preset threshold.
  • the method further includes: counting the number of occurrences of the words in the filtered word record in the preset Internet page database, and obtaining the Internet word frequency.
  • the method further includes: comparing a user cumulative word frequency of the new encoded character string of the word with a user cumulative word frequency of the original encoded character string, and according to the comparison result, allocating the Internet word frequency to the two words of the word or Multiple corresponding encoding strings.
  • the method further includes: generating a new thesaurus according to the filtered word records or adding the filtered word records to the original thesaurus to obtain a new thesaurus or a new version of the full thesaurus.
  • the collected information further includes area information of the user, and divides the user into several areas; performs a filtering step for each area; generates a new regional vocabulary or a new version of the regional full vocabulary for each area.
  • the preset Internet page database is obtained by the following steps: weighting the Internet page; and storing the Internet page whose weight value is greater than or equal to the preset threshold to the Internet page database.
  • the collection is: the input method computing device sends the user's word record with the new encoded string to the collection computing device in real time or at a time.
  • the present invention also provides a method for obtaining a new encoded character string of an input method word, comprising: extracting a word selected by a user in an input process, and a coded string input by a user, and storing the same in a user vocabulary; User vocabulary of each user; comparing the collected user vocabulary and the input vocabulary of the input method, the system vocabulary stores the words and their corresponding encoded strings; according to the preset rules, determining the corresponding words The new encoded string.
  • the method further includes: the user vocabulary further includes a user word frequency, wherein the user word frequency is frequency information of the user inputting the word and the corresponding encoded character string; calculating a cumulative word frequency of the user; An encoded string whose word frequency is less than or equal to the preset threshold.
  • the preset rule is:
  • the encoded string input by the user is the word a new coded string corresponding to the word; further comparing the user cumulative word frequency and the system word frequency of the corresponding encoded string of the word, the system word frequency being the word frequency information corresponding to the existing word preset in the existing thesaurus, If the ratio of the cumulative word frequency of the user to the system word frequency is greater than or equal to the predetermined threshold, it is determined that the encoded character string input by the user is a new encoded character string corresponding to the word.
  • the method further includes: counting the number of occurrences of the word with the newly encoded character string in the preset internet page database, and obtaining the internet word frequency.
  • the method further includes: comparing a user cumulative word frequency of the new encoded character string of the word with a cumulative word frequency of the original encoded character string, and assigning the Internet word frequency to the word according to the comparison result Two or more corresponding encoded strings of words.
  • the invention also provides an input method system, comprising an input interface unit, a display unit and a system vocabulary, further comprising: a word extraction unit, connected to the input method system, for extracting words selected by the user during the input process And a coded string input by the user; the word matching unit is connected to the word extracting unit, and is configured to compare the selected word of the user, the encoded string input by the user, and the system vocabulary, the system lexicon The word and its corresponding encoded string are stored; according to the preset rule, the new encoded string corresponding to the word is determined.
  • a word extraction unit connected to the input method system, for extracting words selected by the user during the input process And a coded string input by the user
  • the word matching unit is connected to the word extracting unit, and is configured to compare the selected word of the user, the encoded string input by the user, and the system vocabulary, the system lexicon
  • the word and its corresponding encoded string are stored; according to the preset rule, the new encoded string corresponding to the
  • the input interface unit, the display unit, and the system vocabulary of the input method system are located in the same computing device; or the input interface unit and the display unit of the input method system are located in the first computing device, and the system vocabulary is located at the first In the second computing device, the input method system obtains corresponding information from the second computing device according to the information input by the user, and displays the corresponding character in the first computing device.
  • the input method system further includes: a communication unit, configured to send a word record with a new encoded string in real time or timing, the word record including the word and its corresponding new encoded string .
  • a communication unit configured to send a word record with a new encoded string in real time or timing, the word record including the word and its corresponding new encoded string .
  • the input method system further includes: a word frequency recording unit, connected to the input method system, configured to record a user word frequency in a user input process, wherein the user word frequency is a user inputting the word and a corresponding coded character thereof The frequency information of the string; the user vocabulary, used to store the words selected by the user, the encoded string input by the user, and the corresponding user word frequency.
  • a word frequency recording unit connected to the input method system, configured to record a user word frequency in a user input process, wherein the user word frequency is a user inputting the word and a corresponding coded character thereof The frequency information of the string; the user vocabulary, used to store the words selected by the user, the encoded string input by the user, and the corresponding user word frequency.
  • the input method system further includes: an application determining unit, configured to determine a current application input by the user, and send the determination result to the word frequency recording unit; the word frequency recording unit is connected to the input method system, and is configured to In the user input process, according to the current application input by the user, the corresponding frequency is corrected and the word frequency information is obtained, and the user word frequency is obtained.
  • an application determining unit configured to determine a current application input by the user, and send the determination result to the word frequency recording unit
  • the word frequency recording unit is connected to the input method system, and is configured to In the user input process, according to the current application input by the user, the corresponding frequency is corrected and the word frequency information is obtained, and the user word frequency is obtained.
  • the present invention also provides a thesaurus generating apparatus, comprising: a word collecting unit, configured to collect a word record of each user having a new encoded character string, the word record including the word and its corresponding new code a string; a first filtering unit, configured to remove duplicate word records; a thesaurus generating unit, configured to generate a new thesaurus according to the filtered word records or add the filtered word records to the original thesaurus, Get a new thesaurus or a new version of the thesaurus.
  • a word collecting unit configured to collect a word record of each user having a new encoded character string, the word record including the word and its corresponding new code a string
  • a first filtering unit configured to remove duplicate word records
  • a thesaurus generating unit configured to generate a new thesaurus according to the filtered word records or add the filtered word records to the original thesaurus, Get a new thesaurus or a new version of the thesaurus.
  • the device further includes: a word frequency collecting unit, configured to collect a user word frequency in a user input behavior, where the user word frequency is a frequency signal that the user inputs the word and the corresponding encoded character string
  • the accumulated word frequency calculation unit is configured to calculate a cumulative word frequency of the user
  • the second filtering unit is configured to remove the encoded character string whose user cumulative word frequency is less than or equal to the preset threshold.
  • the device further includes: a statistical unit, configured to count the number of occurrences of the words in the filtered word record in the preset Internet page database, and obtain the Internet word frequency.
  • a statistical unit configured to count the number of occurrences of the words in the filtered word record in the preset Internet page database, and obtain the Internet word frequency.
  • the device further includes: a word frequency allocation unit: a user cumulative word frequency for comparing the new word frequency of the word with the original code string of the original code string, and according to the comparison result, assigning the Internet word frequency to the Two or more corresponding encoded strings of words.
  • a word frequency allocation unit a user cumulative word frequency for comparing the new word frequency of the word with the original code string of the original code string, and according to the comparison result, assigning the Internet word frequency to the Two or more corresponding encoded strings of words.
  • the invention also provides a thesaurus generating device, comprising:
  • a collecting unit configured to collect input behavior information of each user, where the input behavior information includes a word selected by the user, a coded string input by the user, and a corresponding user word frequency, where the user word frequency is the user inputting the word and corresponding The frequency information of the encoded string;
  • the cumulative word frequency calculation unit performs weight correction on the word frequency of each user corresponding to the word and the encoded character string, and calculates the cumulative word frequency of the word and the encoded user as a whole;
  • the thesaurus includes words, encoded strings, and corresponding word frequency information.
  • the device further includes: a comparison unit, configured to compare the generated thesaurus and the existing thesaurus, wherein the existing thesaurus stores words, encoded strings and corresponding systems thereof a word frequency; a determining unit, configured to determine a new encoded string corresponding to the word according to a preset rule.
  • the device further includes: a filtering unit, configured to remove an encoded character string whose user accumulated word frequency is less than or equal to a preset threshold.
  • a filtering unit configured to remove an encoded character string whose user accumulated word frequency is less than or equal to a preset threshold.
  • a statistical unit that counts the number of occurrences of a word with a new encoded string in a preset Internet page database, and obtains an Internet word frequency
  • a word frequency allocation unit a user cumulative word frequency and original code for comparing the new encoded character string of the word The user of the string accumulates the word frequency, and according to the comparison result, allocates the Internet word frequency to two or more corresponding encoded strings of the word.
  • the device further includes: a comparison unit, configured to compare the generated thesaurus and the existing thesaurus, wherein the existing thesaurus stores the words, the encoded strings, and corresponding a system word frequency; a determining unit, configured to determine an expired word; the expired word is a word that does not exist in the generated thesaurus, but a word existing in the existing thesaurus, or the expired word is In the generated thesaurus, the user accumulates words whose word frequency meets the preset condition.
  • a comparison unit configured to compare the generated thesaurus and the existing thesaurus, wherein the existing thesaurus stores the words, the encoded strings, and corresponding a system word frequency
  • a determining unit configured to determine an expired word
  • the expired word is a word that does not exist in the generated thesaurus, but a word existing in the existing thesaurus, or the expired word is In the generated thesaurus, the user accumulates words whose word frequency meets the preset condition.
  • the present invention has the following advantages:
  • the present invention proposes a distributed architecture, including multiple clients and a collector, through Extracting the words and encoded strings input by the user on the user side, and comparing with the existing thesaurus to learn a new encoded string suitable for the user's usage habit; then collecting and summarizing the new encoded strings of each user and their corresponding words
  • the word, analysis and filtering can obtain a new code string with universal significance; the invention provides a solution from the perspective of user input, and can timely and comprehensively learn the new code string used by the user in the input process, including A new encoded string that reflects the user's dialect habits, as well as new, unimaginable, but often used, new encoded strings that improve the accuracy of preferred words.
  • the present invention places the obtained new encoded character string and its words into a selected Internet page database, and counts the number of occurrences thereof to obtain the Internet word frequency; and according to the user word frequency on the new and old encoded string of the word Distribution, after the Internet word frequency is corrected and given to the old and new coded strings, the most scientific word frequency results can be obtained, thereby avoiding the input efficiency and input experience of other normal users due to the usage habits of some users.
  • the present invention can also be used to collect only new coded strings of users who count a certain area, and obtain the language habits or coding habits of the users in the area, thereby providing an input method system of different pronunciations or coded versions of each area or The input method system allows the user to set the desired area of the area in which they are located.
  • 1 is a flow chart showing the steps of a preferred embodiment of the method for obtaining a new encoded character string of an input method word
  • FIG. 2 is a flow chart showing the steps of another method for obtaining a new encoded character string of an input method word
  • FIG. 3 is a structural block diagram of the input method system
  • FIG. 4 is a structural block diagram of the thesaurus generating apparatus
  • Figure 5 is a block diagram showing the structure of a thesaurus generating apparatus for determining a new encoded character string
  • Fig. 6 is a block diagram showing the structure of a thesaurus generating apparatus for determining expired words.
  • FIG. 1 an excellent method for obtaining a new encoded character string of an input method word according to the present invention is shown in FIG.
  • the flow chart of the steps of the selected embodiment includes the following steps:
  • Step 101 In the user input process, extract the word selected by the user and the code string input by the user.
  • Step 101 can be completed by the input method system, and the input method system can extract the words selected by the user and the encoded string input by the user in any feasible manner during the user input process.
  • the extracted information can be directly performed in step 102, or stored in the user vocabulary, and the user vocabulary is compared with the system vocabulary after a certain time interval.
  • Step 101 is to record the word selected by the user who inputs the information of the user and the encoded string of the input.
  • the encoded character string may be a pinyin code or a font code, that is, the present invention can be applied to various input methods, and of course, the Chinese input method suitable for inputting and using the code is preferably used.
  • the words selected by the user will include some words that the user is used to dialing the phonetic code in the dialect, for example: "Fold,,, the user enters the correct encoding string - - "zhele”; but the input method is original
  • the word string corresponding to the word in the lexicon is "shele", so it cannot be directly displayed to the user in the candidate word, and the user needs to select each word to obtain the desired vocabulary.
  • There are many such words for example, Turning heads “diaotou,,” “tiaotou”; urinary “niaoniao,,” “niaosui”, etc., there are many cases that we can't count. With this invention, we can find such new coded strings as soon as possible, thus Improve the accuracy of preferred words in user input.
  • the user can also create some new coded characters corresponding to the words that are not used in the original thesaurus by the artificial word-making function provided by the input method (for example, Microsoft Pinyin input method or double spell input method). String, so that the user can select the desired word during the input process.
  • the word “Traditional” is a place name in Shanxi.
  • the corresponding encoded string is "fanshi”.
  • the word " ⁇ " in the input method generally corresponds to the code string "shi, zhi”.
  • Step 102 Compare the selected word of the user and the encoded string input by the user with the existing thesaurus.
  • the specification uses the system vocabulary agent to represent the existing vocabulary, because the existing system vocabulary stores the existing words and their corresponding code strings.
  • Step 103 If the word selected by the user exists in the system vocabulary, but the encoded string input by the user is different from the encoded string corresponding to the word stored in the system vocabulary, determining that the encoded string input by the user is The new encoded string corresponding to the word.
  • the user's coding habit can be automatically and simply obtained. Then, after collecting the new coded strings of multiple users and their corresponding words in various ways, and removing the filtering steps such as repeated word records, a new coded string in a general sense can be obtained.
  • the collecting may be: the input method user computing device sends the new code string of the user and its corresponding word to the word collection computing device in real time or timing, that is, the input method computing device has an automatic sending Module.
  • the collection computing device exists in the form of a server.
  • the collecting may also send the new code string and its corresponding word to the collecting end for the input method user periodically or irregularly, that is, the sending is manually initiated by the user, for example, each user will have his own new code.
  • Strings and their corresponding words are sent to a unified email address or unified server for collection.
  • the vocabulary storing the user's personal words can be sent to the collection computing device in real time or periodically. For example, each user can collect the lexicon by backing up the server periodically or irregularly.
  • the collection of the user's new code string and its corresponding words is simpler, because this
  • the input method system used by the user is itself a server, which can be used by multiple users, and the input behavior information of each user can be collected during use.
  • the present invention is feasible in any way that enables information collection, and is no longer - an illustration.
  • step 101 further includes: recording user word frequency to the user during user input
  • the household vocabulary includes a plurality of word records, the word records including the words, corresponding new encoded strings, and corresponding user word frequencies.
  • the process of collecting the user word frequency in step 101 may be: according to the current application input by the user, respectively, the corresponding word weight information is corrected by the corresponding weight, and the user word frequency is obtained.
  • the method further includes:
  • Step 104 Collect a word record of each user with a new code string, where the record includes the word, a corresponding new code string, and a corresponding user word frequency;
  • Step 105 Remove duplicate word records.
  • Step 106 Calculate a cumulative word frequency of the user corresponding to the encoded string
  • the calculation process of the accumulated word frequency of the user can obtain the cumulative word frequency of the user after collecting the summarized words by simply superimposing the user word frequency of each user.
  • the calculation process of the accumulated word frequency of the user may also perform weight correction on each word frequency of each word corresponding to the word, and calculate the cumulative word frequency of the user of each word; the weight correction may be performed by analyzing the word frequency of each user corresponding to a certain word. After completion, for example, firstly, the word frequency of each user corresponding to the word is analyzed, and the distribution trend is found, and the probability of occurrence of a certain word frequency value or the frequency value of the word frequency is corrected from the average range.
  • the cumulative word frequency calculated by the above-mentioned correction can remove some users' accidental behavior or malicious behavior, and obtain objective and accurate user cumulative word frequency, thereby ensuring the accuracy of the thesaurus.
  • Step 107 Remove the encoded character string whose user cumulative word frequency is less than or equal to the preset threshold. This step can remove the special input habits of some individual users who do not have universal significance, and can guarantee the objectivity and accuracy of the new encoded string obtained.
  • Step 108 Count the number of times the words in the filtered word record appear in the preset Internet page database, and obtain the Internet word frequency.
  • the step 108 may further include a weight assignment step: assigning a weight to the Internet page; and storing the Internet page whose weight value is greater than or equal to the preset threshold to the Internet page database.
  • the process is an optional step, the purpose of which is to obtain a database of selected Internet pages. To ensure the accuracy of the screening of new words. Of course, other methods can also be used to form a preset Internet page database.
  • the step of weighting it is a relatively important situation to assign a corresponding weight value according to the time formed by the webpage and the type of the webpage. Because for Internet word frequency statistics, the impact of web page time is very important, so the impact of web page time on the weight value is greater. The farther the time point from the word frequency statistics is, the lower the weight value is. If the time difference is greater than certain The value of the page can give the page a lower weight value, even excluded from the word frequency statistics. Secondly, the type of webpage has a great influence on the word frequency statistics.
  • the webpage type generally refers to a portal website, a forum or some other determined webpages. The weight value of these webpages is higher because there are more participants and information in these webpages.
  • a rule base can be set, and the URL addresses of some webpages are stored in the library, thereby determining that the webpages of these URLs are more important for word frequency statistics, and the words appearing on these webpages are preferred.
  • the web page is given a greater weight value.
  • the present invention can further remove some duplicate web pages, yellow web pages and spam web pages by giving lower weight values, thereby further ensuring the accuracy of new word verification.
  • the vocabulary can be set to include the word, the corresponding Internet word frequency of the word, and the corresponding user cumulative word frequency of the word.
  • the word “heavy” has an Internet word frequency and two user cumulative word frequencies in the thesaurus, corresponding to "chongchong""zhongzhong” 0.
  • Internet word frequency can improve the accuracy of word frequency, but because words cannot be found on the Internet. Reflecting the encoded string, the user can accumulate the word frequency to reflect the user's input habits and improve the hit rate of the preferred word.
  • step 108 may not be needed, and the vocabulary may be included to include the word, the original word frequency of the word, and the corresponding cumulative word frequency of the word.
  • the above-mentioned one word corresponds to two word frequency.
  • the use process is complicated, and two types of word frequency data are needed to achieve the best effect.
  • the preferred embodiment shown in FIG. 1 may further include step 109, the above two types.
  • the word frequency data is adjusted to a word frequency data.
  • Step 109 Allocate two or more corresponding encoded character strings of the Internet word frequency to the word according to the ratio of the user cumulative word frequency of the new encoded character string of the word to the cumulative word frequency of the original encoded character string. That is to say, the words appearing in the Internet correspond to two or more corresponding encoded character strings, and according to the difference of the cumulative word frequency of the input character string input by the user, the Internet word frequency reflecting the total word frequency of the word is allocated to the two words of the word.
  • One or more corresponding encoding strings thereby objectively and accurately reflecting the user's input habits, and improving the accuracy of the preferred words in the user input process.
  • step 109 only gives an example in the allocation of Internet word frequency.
  • Step 1010 Generate a new thesaurus according to the filtered word records or add the filtered word records to the original thesaurus to obtain a new thesaurus or a new version of the full thesaurus.
  • the word record includes the word, the corresponding new encoded string, and corresponding word frequency information.
  • Fig. 1 can be used to collect new code strings for users nationwide, and then to derive a new thesaurus or a new version of the full thesaurus suitable for most people, thereby improving the input experience of users in various regions.
  • the embodiment shown in FIG. 1 can also be used in the following cases: the collected new user-encoded character string is still collected, and the collected information further includes the area information of the user, and the user is divided into several areas; The area performs the filtering step; a new regional vocabulary or a new version of the regional vocabulary is generated for each region. That is, the different pronunciations of people in each area can be separately counted, the input method system of different coding versions of each area can be provided, or the user can set the area where the user is located in the input method system, thereby more personalizedly satisfying the coding of users in each area. Input habits.
  • the new thesaurus obtained in the above steps or the new version of the full thesaurus can be used to update the input method.
  • the stored new vocabulary or the new version of the second vocabulary second computing device may exist in the network in the form of a server, and provide a vocabulary update service to any other client program that needs to input the new vocabulary information. Of course, it does not need to be in the form of a fixed server. It can also exist in In a local computing device, the vocabulary update service is provided to any client terminal of other terminals that needs to input the new word information through P2P (peer-to-peer) technology.
  • P2P peer-to-peer
  • the updating may be performed by: updating the system vocabulary at the same time when the input method system is updated; or performing online update of the system vocabulary by means of the server actively pushing; or, by the user The request is initiated, and the server returns data according to the request to update the system vocabulary.
  • the server returns data according to the request to update the system vocabulary.
  • various data update methods may be used, and the present invention is not limited thereto, and those skilled in the art may select them according to needs.
  • setting a unit for receiving user input information and displaying corresponding characters in the input method system is located in the first computing device; the obtained new thesaurus or the new version of the full thesaurus is the input method system a system vocabulary, the system vocabulary is located in a second computing device; the input method system obtains corresponding information from a system vocabulary located in the second computing device according to information input by the user, and displays corresponding characters in the first computing device , complete the text input.
  • the new thesaurus or the new version of the whole thesaurus obtained according to the present invention can be directly used as the system vocabulary of the input method system, and the online thesaurus can be used without updating operations.
  • the input method system is divided into two parts, the receiving and displaying unit is located in the first computing device, and the thesaurus information is located in the second computing device, which can perfectly implement the online application of the input method; of course, the encoding required for the input method system
  • the matching process can be arbitrarily set in a computing device as needed.
  • a flow chart of another method for obtaining a new encoded character string of an input method word according to the present invention includes the following steps:
  • Step 201 Extract a word selected by the user in the input process, and a coded character string input by the user, and store the coded character string in the user vocabulary;
  • Step 202 Collect a user vocabulary of each user
  • Step 203 Compare the collected user vocabulary and the input method system vocabulary, where the system vocabulary stores words and corresponding code strings;
  • Step 204 Determine, according to the preset rule, a new coded string corresponding to the word.
  • the preset rule may be: if the word selected by the user exists in the system lexicon, but the encoded string input by the user is different from the encoded string corresponding to the word stored in the system vocabulary, then the user input is determined.
  • the encoded string is the new encoded string corresponding to the word.
  • the preset rule may be: if the selected word of the user and the encoded string input by the user exist in the existing thesaurus, the user cumulative word frequency and system for further comparing the corresponding encoded string of the word Word frequency, the word frequency of the system is the word frequency information corresponding to the existing words preset in the existing lexicon. If the ratio of the cumulative word frequency of the user to the system word frequency is greater than or equal to the predetermined threshold, the code string input by the user is determined to be The new encoded string corresponding to the word.
  • the embodiment shown in FIG. 2 is basically similar to the concept of the embodiment shown in FIG. 1.
  • the main difference is that the user vocabulary of multiple users is collected first, and then the comparison is performed uniformly, and the new encoded character string of the user is obtained according to the comparison result;
  • This method can reduce the number of comparison calculations, and can reduce the burden of the local input method system. It can be directly used in the existing input method system, but the comparison will be performed after a large number of user-selected words are collected, and the server will be added. System burden. For both, those skilled in the art can select and use as needed.
  • the embodiment shown in FIG. 2 may further include a filtering step: wherein the user vocabulary calculates a user cumulative word frequency corresponding to the encoded character string; and removes the encoded character string whose user cumulative word frequency is less than or equal to the preset threshold.
  • a filtering step wherein the user vocabulary calculates a user cumulative word frequency corresponding to the encoded character string; and removes the encoded character string whose user cumulative word frequency is less than or equal to the preset threshold.
  • the embodiment shown in FIG. 2 may further include a word frequency giving step: counting the number of occurrences of the word with the new encoded character string in the preset Internet page database, obtaining the Internet word frequency; comparing the new coded characters of the word The user of the string accumulates the word frequency and the user's accumulated word frequency of the original encoded string, and according to the comparison result, the Internet word frequency is allocated to two or more corresponding encoded strings of the word.
  • a word frequency giving step counting the number of occurrences of the word with the new encoded character string in the preset Internet page database, obtaining the Internet word frequency; comparing the new coded characters of the word The user of the string accumulates the word frequency and the user's accumulated word frequency of the original encoded string, and according to the comparison result, the Internet word frequency is allocated to two or more corresponding encoded strings of the word.
  • FIG. 3 it is a structural block diagram of an input method system according to the present invention, which includes an input interface unit 301, a display unit 302, and a system vocabulary 303.
  • the input method system further includes:
  • a word extracting unit 304 connected to the input method system, for extracting a word selected by the user and a coded string input by the user during the user input process;
  • the word matching unit 305 is connected to the word extracting unit 304, and is configured to compare the selected word of the user, the encoded string input by the user, and the system vocabulary, wherein the system vocabulary stores the word and Corresponding encoded string; According to the preset rules, the new encoded string corresponding to the word is determined.
  • the preset rule may be: if the user selected word exists in the system vocabulary, but the user loses If the encoded string entered is different from the encoded string corresponding to the word stored in the system vocabulary, it is determined that the encoded string input by the user is a new encoded string corresponding to the word.
  • the input method system can be used to extract a new code string of the user in addition to the ordinary word input.
  • the input method system may be a common input method system.
  • the input interface unit 301, the display unit 302, and the system vocabulary 303 of the input method system are located in the same computing device, and the input method system is based on the encoded information input by the user.
  • the corresponding characters are displayed locally by local query matching.
  • the input method system may also be a network input method system.
  • the input interface unit 301 and the display unit 302 of the input method system are located in a first computing device, and the system vocabulary 303 is located in a second computing device. The system obtains corresponding information from the second computing device according to the information input by the user, and displays the corresponding character in the first computing device.
  • the input method system preferably further comprises: a communication unit
  • the input method system preferably further includes: a word frequency recording unit 307 connected to the input method system for inputting at the user In the process, the user word frequency is recorded, and the user word frequency is frequency information corresponding to the code string input by the user; the user vocabulary 308 is configured to store the word selected by the user, the code string input by the user, and the corresponding user. Word frequency.
  • the input interface unit 301 in the above input method system is most important for providing the user with information input and word selection; and can also be used for switching various modes, for example: input language switching (such as: Simplified and Traditional Chinese, Chinese and English switching), input mode switching (such as: single-word input, word input, sentence input switching), input state switching (such as: text, punctuation, special symbol switching) and so on.
  • input language switching Such as: Simplified and Traditional Chinese, Chinese and English switching
  • input mode switching such as: single-word input, word input, sentence input switching
  • input state switching such as: text, punctuation, special symbol switching
  • the display unit 302 and the system vocabulary 303 are all well known to those skilled in the art and will not be described in detail herein.
  • the input method system shown in FIG. 3 may further include: an application determining unit 309, configured to determine a current application input by the user, and send the determination result to the word frequency recording unit 307; the word frequency recording unit 307 is configured to During the user input process, according to the current application input by the user, the word frequency information is separately counted, and the corresponding weights of the texts are corrected to form a user word frequency. That is, the input method system can respectively perform the corresponding weight assignment and the statistic word frequency according to the current application input by the user. For example, since the preferred method of the present invention can statistically obtain the Internet word frequency, the value is calculated; and the user inputs in the online community forum. Words, because they can be counted from the Internet, can be given a relatively low weight value.
  • FIG. 4 is a structural block diagram of a thesaurus generating apparatus of the present invention, which includes the following components: a collecting unit 401, configured to collect a word record of each user having a new encoded character string, the word record including the word And its corresponding new encoded string.
  • a collecting unit 401 configured to collect a word record of each user having a new encoded character string, the word record including the word And its corresponding new encoded string.
  • the vocabulary generating means can be implemented by a server, and the collecting can be implemented by various methods as described above.
  • the user's word record with the new encoded character string can be obtained by the input method and automatically sent to the collecting unit 401; or can be set or organized by the user and sent to the collecting unit 401; or each user has a new encoding
  • the word records of the string are collected into a fixed network space, and the collecting unit 401 acquires a word record of each user having a new encoded character string from the network space. That is, the word record of the user having the newly encoded character string in the embodiment is not necessarily obtained by the user input behavior, and may be set or organized by the user.
  • a first filtering unit 402 configured to remove duplicate word records
  • the thesaurus generating unit 403 is configured to generate a new thesaurus according to the filtered word records or add the filtered word records to the original thesaurus to obtain a new thesaurus or a new version of the full thesaurus.
  • the vocabulary generating device is further configured to: collect the user word frequency in the user input behavior, the user The word frequency is the frequency information corresponding to the encoded character string input by the user; the cumulative word frequency calculation unit 404 is configured to calculate the user cumulative word frequency corresponding to the encoded character string; and the second filtering unit 405 is configured to remove the user cumulative word frequency less than or equal to the preset.
  • a wide-valued encoded string the statistics of the user's word frequency, preferably, may also be respectively based on the current application input by the user, and the corresponding weights are used.
  • the vocabulary generating device preferably further includes:
  • the Internet page database generating unit 406 is configured to perform weight assignment on the Internet page; and store the Internet page whose weight value is greater than or equal to the preset threshold to the Internet page database.
  • the statistic unit 407 is configured to count the number of occurrences of the words in the filtered word record in the preset Internet page database, and obtain the Internet word frequency.
  • a word frequency allocation unit 408 a user cumulative word frequency for comparing a new encoded character string of the word with a user cumulative word frequency of the original encoded character string, and according to the comparison result, assigning the Internet word frequency to two or more corresponding codes of the word String.
  • the user accumulated word frequency corresponding to the original encoded character string may be obtained by other means, or in the collecting unit 401, the original encoded character string of the word and the corresponding user word frequency information may be simultaneously collected, for each user.
  • the user word frequency is calculated to obtain the user cumulative word frequency.
  • the collecting unit 501 is configured to collect input behavior information of each user, where the input behavior information includes a word selected by a user, and a code input by the user. a character string and a corresponding user word frequency, wherein the user word frequency is frequency information of the user inputting the word and the corresponding encoded character string;
  • the cumulative word frequency calculation unit 502 performs weight correction on the word frequency of each user corresponding to the word and the encoded character string, and calculates the cumulative word frequency of the word and the encoded user as a whole;
  • the thesaurus generating unit 503 is configured to generate a thesaurus, the thesaurus includes a word, an encoded string, and corresponding word frequency information.
  • the thesaurus generating apparatus shown in FIG. 5 may further include: a matching unit 504, configured to compare the generated thesaurus and the existing thesaurus, wherein the existing thesaurus stores words, encoded strings, and The corresponding system word frequency; the new code string determining unit 505 is configured to determine a new code string corresponding to the word according to the preset rule. Then, the thesaurus generating device can implement acquisition of a new encoded string.
  • a matching unit 504 configured to compare the generated thesaurus and the existing thesaurus, wherein the existing thesaurus stores words, encoded strings, and The corresponding system word frequency
  • the new code string determining unit 505 is configured to determine a new code string corresponding to the word according to the preset rule. Then, the thesaurus generating device can implement acquisition of a new encoded string.
  • the vocabulary generating device further includes: a filtering unit 506, configured to remove an encoded character string whose user accumulated word frequency is less than or equal to a preset threshold.
  • the vocabulary generating device preferably further includes:
  • the statistical unit 507 counts the number of occurrences of the word with the new encoded string in the preset Internet page database, and obtains the Internet word frequency.
  • the Internet page database forms a Internet page database by weighting the Internet page and storing the Internet page whose weight value is greater than or equal to the preset threshold.
  • Word frequency allocation unit 508 user cumulative word frequency used to compare the new encoded string of the word with the original The user of the encoded string accumulates the word frequency, and according to the comparison result, allocates the Internet word frequency to two or more corresponding encoded strings of the word.
  • FIG. 6 Another lexicon generating apparatus is shown, which includes the following components:
  • the collecting unit 601 is configured to collect input behavior information of each user, where the input behavior information includes a word selected by the user, a coded string input by the user, and a corresponding user word frequency, where the user word frequency is the user inputting the word and Corresponding to the frequency information of the encoded string;
  • the cumulative word frequency calculation unit 602 performs weight correction on the word frequency of each user corresponding to the word and the encoded character string, and calculates the cumulative word frequency of the word and the encoded user as a whole;
  • the thesaurus generating unit 603 is configured to generate a thesaurus, the thesaurus includes words, encoded strings, and corresponding word frequency information.
  • the comparison unit 604 is configured to compare the generated thesaurus and the existing thesaurus, wherein the existing thesaurus stores the words, the encoded strings, and the corresponding system word frequencies;
  • An expired word determining unit 605 configured to determine an expired word, the expired word is a word that does not exist in the generated thesaurus, but exists in an existing thesaurus; or the generated word
  • the words in the library that accumulate the word frequency according to the preset conditions. For example, the user cumulative word frequency is less than or equal to a predetermined threshold.
  • the existing thesaurus can be streamlined according to these expired words to prevent the existing thesaurus from getting bigger and bigger, for example, filtering from the existing thesaurus,
  • the expired words are deleted, thereby reducing the volume of the thesaurus, improving the utilization of the thesaurus, and improving the input efficiency.

Abstract

A method for obtaining the new encode character string of the inputting method word comprises: extracting the word selected by the user in the input process, and the encode character string inputted by the user; comparing the word selected by the user and encode character string inputted by the user with the present word base, wherein the present word base stores the present word and the encode character string corresponding to the present word; determining the new encode character string corresponding to the word according the preset rule.

Description

获取新编码字符串的方法及输入法系统、 词库生成装置 本申请要求于 2006 年 8 月 23 日提交中国专利局、 申请号为 200610111562.9、 发明名称为"获取新编码字符串的方法及输入法系统、 词库 生成装置"的中国专利申请的优先权, 其全部内容通过引用结合在本申请中。 技术领域  Method for obtaining new coded string and input method system and vocabulary generating device The present application claims to be submitted to the Chinese Patent Office on August 23, 2006, the application number is 200610111562.9, and the invention name is "Method and input method for obtaining a new code string" The priority of the Chinese patent application of the system, the lexicon generating device, the entire contents of which are hereby incorporated by reference. Technical field
本发明涉及输入法领域,特别是涉及一种获取输入法字词的新编码字符串 的方法及输入法系统、 词库生成装置。  The present invention relates to the field of input methods, and in particular, to a method for acquiring a new encoded character string of an input method word, an input method system, and a thesaurus generating device.
背景技术 Background technique
现有的输入法系统都是才艮据用户输入的编码字符串,匹配得到用户所需的 字词, 例如, 中文、 日文以及韩文等等输入法系统。 在现有输入法的系统词库 中为每一个字词设定了一个相应的编码字符串,用户只有输入正确的编码字符 串才能获得所需的字词。  The existing input method system is based on the input string of the user input, and matches the words required by the user, for example, Chinese, Japanese, and Korean input method systems. In the system vocabulary of the existing input method, a corresponding code string is set for each word, and the user can only obtain the desired word by inputting the correct code string.
但是用户对于正确的编码字符串有一个学习的过程,很难保证用户认知的 所有编码字符串与字词的对应都是正确的,所以现有的输入法系统为了提高容 错性和满足一些用户对编码字符串的习惯, 提出了模糊音的解决方案, 例如, z = zh, s = sh, in = ing等等。 模糊音的解决方案固然可以解决一些由于南北语 但是由于各地区都有各自的方言 (尤其对于中文这样方言众多的文字而 言), 这样当用户利用拼音码输入字词, 多多少少在输入编码时都会存在着不 太准确的问题, 上述模糊音解决方案并不能解决所有的问题。 例如, "折了" 一词, 有的用户习惯输入" shele"、 有的用户习惯输入" zhele"; "落下"一词, 有 的情况需要输入" laxia,,、有的情况需要输入" luoxia"; "和牌 "一词对应的" hupai" 和" hepai"; 这些都无法通过模糊音的方式解决。输入法系统词库中不可能获知 所有的方言习惯,所以需要用户多次从候选字词中排序靠后的位置选取所需字 词, 严重影响用户的输入速度。  However, the user has a learning process for the correct encoded string. It is difficult to ensure that the correspondence between all encoded strings and words recognized by the user is correct, so the existing input method system improves fault tolerance and satisfies some users. For the habit of encoding strings, a solution to fuzzy sounds is proposed, for example, z = zh, s = sh, in = ing, and so on. The fuzzy sound solution can solve some of the problems because of the North and South languages, but because each region has its own dialect (especially for the dialects with a large number of dialects in Chinese), so when users use the phonetic code to input words, more or less input code There are times when there are less accurate problems, and the above fuzzy sound solution does not solve all the problems. For example, the word "folded", some users are accustomed to input "shele", some users are accustomed to input "zhele"; "fall", some cases need to enter "laxia,,, some cases need to input" luoxia "Hupai" and "hepai" corresponding to the word "和牌"; these cannot be solved by means of fuzzy sounds. It is impossible to know all the dialect habits in the input method system lexicon, so the user needs to learn from the candidate multiple times. The position behind the word is sorted to select the desired word, which seriously affects the user's input speed.
因此,如何尽快尽多的获知用户的方言习惯用法,提高此时的输入法系统 首选词的命中率, 成为本领域技术人员迫切需要解决的技术问题之一。  Therefore, how to get as much as possible to know the user's dialect idioms and improve the hit rate of the preferred words in the input method system at this time has become one of the technical problems urgently needed to be solved by those skilled in the art.
发明内容 Summary of the invention
本发明所要解决的技术问题是提供一种获取输入法字词的新编码字符串 的方法及装置,能够获取各用户使用的新编码字符串, 并汇总生成词库, 从而 可以满足用户的新编码字符串的习惯用法, 提高用户首选词的命中率。 The technical problem to be solved by the present invention is to provide a new encoded string for obtaining input method words. The method and device can obtain a new code string used by each user, and aggregate the generated lexicon, thereby satisfying the idiom of the user's new code string and improving the hit rate of the user's preferred word.
本发明的另一目的在于, 提供一种输入法系统, 可以简单方便、 及时有效 的自动获取该用户针对一些字词习惯使用的编码字符串,通过比较即可获取各 用户使用的新编码字符串。  Another object of the present invention is to provide an input method system, which can automatically and automatically acquire the encoded character string used by the user for some words in a simple, convenient, timely and effective manner, and obtain a new coded string used by each user by comparison. .
本发明的另一目的还在于提供一种词库生成装置,可以高效率的提供比较 准确的、 包括适合用户输入习惯的新编码字符串的全词库或者新词库。  Another object of the present invention is to provide a thesaurus generating apparatus capable of efficiently providing a relatively complete whole dictionary or a new thesaurus including a new encoded character string suitable for user input habits.
为解决上述技术问题,本发明提供了一种获取输入法字词的新编码字符串 的方法, 包括:提取用户在输入过程中所选择的字词, 以及用户输入的编码字 符串; 将用户所选字词、 用户输入的编码字符串与现有词库进行比对, 所述现 有词库中存储有现有字词及其相应的编码字符串; 根据预置规则, 确定字词相 应的新编码字符串。  In order to solve the above technical problem, the present invention provides a method for obtaining a new encoded character string of an input method word, which comprises: extracting a word selected by a user in an input process, and an encoded character string input by a user; The selected word and the encoded string input by the user are compared with the existing thesaurus, wherein the existing thesaurus stores the existing words and their corresponding encoded strings; according to the preset rules, the corresponding words are determined. New encoded string.
优选的, 所述的方法, 还包括: 将用户所选字词、 用户输入的编码字符串 记录至用户词库; 并在用户输入过程中, 记录用户词频至用户词库, 所述用户 词频为用户输入该字词及其相应编码字符串的频率信息。  Preferably, the method further includes: recording a word selected by the user and a code string input by the user to the user vocabulary; and during the user input process, recording the user word frequency to the user vocabulary, wherein the user word frequency is The user enters frequency information for the word and its corresponding encoded string.
优选的, 所述的方法, 还包括: 根据用户输入的当前应用程序, 分别加以 相应的权重修正后统计词频信息, 得到用户词频。  Preferably, the method further includes: according to the current application input by the user, respectively, the corresponding weight corrected statistical word frequency information is used to obtain the user word frequency.
优选的, 所述的方法, 还包括: 收集各个用户的具有新编码字符串的字词 记录, 所述记录包括该字词、 相应的新编码字符串以及相应的词频信息; 去除 重复的字词记录。  Preferably, the method further includes: collecting a word record of each user having a new encoded character string, the record including the word, the corresponding new coded string, and corresponding word frequency information; removing duplicate words recording.
优选的, 所述的方法, 还包括: 计算用户累积词频; 去除用户累积词频小 于或者等于预置阔值的编码字符串。  Preferably, the method further includes: calculating a cumulative word frequency of the user; and removing the encoded character string whose user cumulative word frequency is less than or equal to the preset threshold.
优选的, 所述的方法, 还包括: 统计过滤后的字词记录中的字词在预置的 互联网页面数据库中出现的次数, 得到互联网词频。  Preferably, the method further includes: counting the number of occurrences of the words in the filtered word record in the preset Internet page database, and obtaining the Internet word frequency.
优选的, 所述的方法, 还包括: 比较该字词的新编码字符串的用户累积词 频与原编码字符串的用户累积词频,根据比较结果, 分配其互联网词频至该字 词的两个或者多个相应编码字符串。  Preferably, the method further includes: comparing a user cumulative word frequency of the new encoded character string of the word with a user cumulative word frequency of the original encoded character string, and according to the comparison result, allocating the Internet word frequency to the two words of the word or Multiple corresponding encoding strings.
优选的, 所述的方法, 还包括: 根据过滤后的字词记录生成新词库或者将 过滤后的字词记录添加至原有词库, 得到新词库或者新版的全词库。 其中, 所述收集的信息还包括用户所在的区域信息,将用户划分为若干区 域; 针对每个区域进行过滤步骤; 针对每个区域生成区域新词库或者新版的区 域全词库。 Preferably, the method further includes: generating a new thesaurus according to the filtered word records or adding the filtered word records to the original thesaurus to obtain a new thesaurus or a new version of the full thesaurus. The collected information further includes area information of the user, and divides the user into several areas; performs a filtering step for each area; generates a new regional vocabulary or a new version of the regional full vocabulary for each area.
优选的,通过以下步骤获得预置的互联网页面数据库: 对互联网页面进行 权重赋值;将权重值大于或者等于预置阔值的互联网页面存储至互联网页面数 据库。  Preferably, the preset Internet page database is obtained by the following steps: weighting the Internet page; and storing the Internet page whose weight value is greater than or equal to the preset threshold to the Internet page database.
所述收集为:输入法计算设备实时或者定时的将用户的具有新编码字符串 的字词记录发送至收集计算设备。  The collection is: the input method computing device sends the user's word record with the new encoded string to the collection computing device in real time or at a time.
本发明还提供了一种获取输入法字词的新编码字符串的方法, 包括: 提取用户在输入过程中所选择的字词, 以及用户输入的编码字符串, 并存 储至用户词库; 收集各个用户的用户词库; 对比所述收集的用户词库和输入法 现有词库,所述系统词库中存储有字词及其相应的编码字符串;根据预置规则, 确定字词相应的新编码字符串。  The present invention also provides a method for obtaining a new encoded character string of an input method word, comprising: extracting a word selected by a user in an input process, and a coded string input by a user, and storing the same in a user vocabulary; User vocabulary of each user; comparing the collected user vocabulary and the input vocabulary of the input method, the system vocabulary stores the words and their corresponding encoded strings; according to the preset rules, determining the corresponding words The new encoded string.
优选的, 所述的方法, 还包括: 所述用户词库中还包括用户词频, 所述用 户词频为用户输入该字词及其相应编码字符串的频率信息; 计算用户累积词 频; 去除用户累积词频小于或者等于预置阔值的编码字符串。  Preferably, the method further includes: the user vocabulary further includes a user word frequency, wherein the user word frequency is frequency information of the user inputting the word and the corresponding encoded character string; calculating a cumulative word frequency of the user; An encoded string whose word frequency is less than or equal to the preset threshold.
其中, 所述预置的规则为:  The preset rule is:
如果用户所选字词在现有词库中存在,但是用户输入的编码字符串与现有 词库中存储的该字词相应的编码字符串不同,则确定用户输入的编码字符串为 该字词相应的新编码字符串; 则进一步比较该字词相应的编码字符串的用户累积词频和系统词频,所述系统 词频为在现有词库中预置的现有字词相应的词频信息,如果用户累积词频与系 统词频的比值大于或者等于预定阔值,则确定用户输入的编码字符串为该字词 相应的新编码字符串。  If the word selected by the user exists in the existing thesaurus, but the encoded string input by the user is different from the encoded string corresponding to the word stored in the existing thesaurus, it is determined that the encoded string input by the user is the word a new coded string corresponding to the word; further comparing the user cumulative word frequency and the system word frequency of the corresponding encoded string of the word, the system word frequency being the word frequency information corresponding to the existing word preset in the existing thesaurus, If the ratio of the cumulative word frequency of the user to the system word frequency is greater than or equal to the predetermined threshold, it is determined that the encoded character string input by the user is a new encoded character string corresponding to the word.
优选的, 所述的方法, 还包括: 统计具有新编码字符串的字词在预置的互 联网页面数据库中出现的次数, 得到互联网词频。  Preferably, the method further includes: counting the number of occurrences of the word with the newly encoded character string in the preset internet page database, and obtaining the internet word frequency.
优选的, 所述的方法, 还包括: 比较该字词的新编码字符串的用户累积词 频与原编码字符串的用户累积词频,根据比较结果, 分配其互联网词频至该字 词的两个或者多个相应编码字符串。 Preferably, the method further includes: comparing a user cumulative word frequency of the new encoded character string of the word with a cumulative word frequency of the original encoded character string, and assigning the Internet word frequency to the word according to the comparison result Two or more corresponding encoded strings of words.
本发明还提供了一种输入法系统, 包括输入接口单元、显示单元以及系统 词库, 还包括: 字词提取单元, 与输入法系统相连, 用于提取用户在输入过程 中所选择的字词, 以及用户输入的编码字符串; 字词比对单元, 与字词提取单 元相连, 用于将用户所选字词、 用户输入的编码字符串与系统词库进行比对, 所述系统词库中存储有字词及其相应的编码字符串; 根据预置规则,确定字词 相应的新编码字符串。  The invention also provides an input method system, comprising an input interface unit, a display unit and a system vocabulary, further comprising: a word extraction unit, connected to the input method system, for extracting words selected by the user during the input process And a coded string input by the user; the word matching unit is connected to the word extracting unit, and is configured to compare the selected word of the user, the encoded string input by the user, and the system vocabulary, the system lexicon The word and its corresponding encoded string are stored; according to the preset rule, the new encoded string corresponding to the word is determined.
优选的, 所述输入法系统的输入接口单元、显示单元以及系统词库位于同 一计算设备中; 或者所述输入法系统的输入接口单元、显示单元位于第一计算 设备中,系统词库位于第二计算设备中,所述输入法系统根据用户输入的信息, 从位于第二计算设备中获取相应信息, 在第一计算设备显示相应字符。  Preferably, the input interface unit, the display unit, and the system vocabulary of the input method system are located in the same computing device; or the input interface unit and the display unit of the input method system are located in the first computing device, and the system vocabulary is located at the first In the second computing device, the input method system obtains corresponding information from the second computing device according to the information input by the user, and displays the corresponding character in the first computing device.
优选的, 所述的输入法系统, 还包括: 通信单元, 用于实时或者定时的发 送具有新编码字符串的字词记录,所述字词记录包括该字词以及其相应的新编 码字符串。  Preferably, the input method system further includes: a communication unit, configured to send a word record with a new encoded string in real time or timing, the word record including the word and its corresponding new encoded string .
优选的, 所述的输入法系统,还包括:词频记录单元,与输入法系统相连, 用于在用户输入过程中, 记录用户词频, 所述用户词频为用户输入该字词及其 相应编码字符串的频率信息; 用户词库, 用于存储用户所选字词、 用户输入的 编码字符串及其相应的用户词频。  Preferably, the input method system further includes: a word frequency recording unit, connected to the input method system, configured to record a user word frequency in a user input process, wherein the user word frequency is a user inputting the word and a corresponding coded character thereof The frequency information of the string; the user vocabulary, used to store the words selected by the user, the encoded string input by the user, and the corresponding user word frequency.
优选的, 所述的输入法系统, 还包括: 应用程序判断单元, 用于判断用户 输入的当前应用程序, 并将判断结果发送至词频记录单元; 词频记录单元, 与 输入法系统相连, 用于在用户输入过程中, 才艮据用户输入的当前应用程序, 分 别加以相应的权重爹正后统计词频信息, 得到用户词频。  Preferably, the input method system further includes: an application determining unit, configured to determine a current application input by the user, and send the determination result to the word frequency recording unit; the word frequency recording unit is connected to the input method system, and is configured to In the user input process, according to the current application input by the user, the corresponding frequency is corrected and the word frequency information is obtained, and the user word frequency is obtained.
本发明还提供了一种词库生成装置, 包括: 字词收集单元, 用于收集各个 用户的具有新编码字符串的字词记录,所述字词记录包括该字词以及其相应的 新编码字符串; 第一过滤单元, 用于去除重复的字词记录; 词库生成单元, 用 于根据过滤后的字词记录生成新词库或者将过滤后的字词记录添加至原有词 库, 得到新词库或者新版的全词库。  The present invention also provides a thesaurus generating apparatus, comprising: a word collecting unit, configured to collect a word record of each user having a new encoded character string, the word record including the word and its corresponding new code a string; a first filtering unit, configured to remove duplicate word records; a thesaurus generating unit, configured to generate a new thesaurus according to the filtered word records or add the filtered word records to the original thesaurus, Get a new thesaurus or a new version of the thesaurus.
优选的, 所述的装置, 还包括: 词频收集单元, 用于收集用户输入行为中 的用户词频, 所述用户词频为用户输入该字词及其相应编码字符串的频率信 息; 累积词频计算单元, 用于计算用户累积词频; 第二过滤单元, 用于去除用 户累积词频小于或者等于预置阔值的编码字符串。 Preferably, the device further includes: a word frequency collecting unit, configured to collect a user word frequency in a user input behavior, where the user word frequency is a frequency signal that the user inputs the word and the corresponding encoded character string The accumulated word frequency calculation unit is configured to calculate a cumulative word frequency of the user; and the second filtering unit is configured to remove the encoded character string whose user cumulative word frequency is less than or equal to the preset threshold.
优选的, 所述的装置, 还包括: 统计单元, 用于统计过滤后的字词记录中 的字词在预置的互联网页面数据库中出现的次数, 得到互联网词频。  Preferably, the device further includes: a statistical unit, configured to count the number of occurrences of the words in the filtered word record in the preset Internet page database, and obtain the Internet word frequency.
优选的, 所述的装置, 还包括: 词频分配单元: 用于比较该字词的新编码 字符串的用户累积词频与原编码字符串的用户累积词频,根据比较结果, 分配 其互联网词频至该字词的两个或者多个相应编码字符串。  Preferably, the device further includes: a word frequency allocation unit: a user cumulative word frequency for comparing the new word frequency of the word with the original code string of the original code string, and according to the comparison result, assigning the Internet word frequency to the Two or more corresponding encoded strings of words.
本发明还提供了一种词库生成装置, 包括:  The invention also provides a thesaurus generating device, comprising:
收集单元, 用于收集各用户的输入行为信息, 所述输入行为信息包括用户 选择的字词, 用户输入的编码字符串以及相应的用户词频, 所述用户词频为用 户输入该字词及其相应编码字符串的频率信息;  a collecting unit, configured to collect input behavior information of each user, where the input behavior information includes a word selected by the user, a coded string input by the user, and a corresponding user word frequency, where the user word frequency is the user inputting the word and corresponding The frequency information of the encoded string;
累积词频计算单元,对字词与编码字符串整体相应的各用户词频进行权重 修正, 计算该字词与编码字符串整体用户累积词频;  The cumulative word frequency calculation unit performs weight correction on the word frequency of each user corresponding to the word and the encoded character string, and calculates the cumulative word frequency of the word and the encoded user as a whole;
词库生成单元, 所述词库包括字词、 编码字符串及其相应的词频信息。 优选的, 所述的装置, 还包括: 比对单元, 用于对比所述生成的词库和现 有词库, 所述现有词库中存储有字词、 编码字符串及其相应的系统词频; 确定 单元, 用于根据预置规则, 确定字词相应的新编码字符串。  A thesaurus generating unit, the thesaurus includes words, encoded strings, and corresponding word frequency information. Preferably, the device further includes: a comparison unit, configured to compare the generated thesaurus and the existing thesaurus, wherein the existing thesaurus stores words, encoded strings and corresponding systems thereof a word frequency; a determining unit, configured to determine a new encoded string corresponding to the word according to a preset rule.
优选的, 所述的装置, 还包括: 过滤单元, 用于去除用户累积词频小于或 者等于预置阔值的编码字符串。 统计单元, 统计具有新编码字符串的字词在预 置的互联网页面数据库中出现的次数, 得到互联网词频; 词频分配单元: 用于 比较该字词的新编码字符串的用户累积词频与原编码字符串的用户累积词频, 根据比较结果, 分配其互联网词频至该字词的两个或者多个相应编码字符串。  Preferably, the device further includes: a filtering unit, configured to remove an encoded character string whose user accumulated word frequency is less than or equal to a preset threshold. a statistical unit that counts the number of occurrences of a word with a new encoded string in a preset Internet page database, and obtains an Internet word frequency; a word frequency allocation unit: a user cumulative word frequency and original code for comparing the new encoded character string of the word The user of the string accumulates the word frequency, and according to the comparison result, allocates the Internet word frequency to two or more corresponding encoded strings of the word.
或者, 优选的, 所述的装置, 还包括: 比对单元, 用于对比所述生成的词 库和现有词库,所述现有词库中存储有字词、编码字符串及其相应的系统词频; 确定单元, 用于确定过期字词; 所述过期字词为在所述生成的词库中不存在, 但是在现有词库中存在的字词,或者所述过期字词为在所述生成的词库中用户 累积词频符合预置条件的字词。  Or, preferably, the device further includes: a comparison unit, configured to compare the generated thesaurus and the existing thesaurus, wherein the existing thesaurus stores the words, the encoded strings, and corresponding a system word frequency; a determining unit, configured to determine an expired word; the expired word is a word that does not exist in the generated thesaurus, but a word existing in the existing thesaurus, or the expired word is In the generated thesaurus, the user accumulates words whose word frequency meets the preset condition.
与现有技术相比, 本发明具有以下优点:  Compared with the prior art, the present invention has the following advantages:
首先, 本发明提出了分布式的架构, 包括多个用户端和一个收集端, 通过 在用户端提取用户输入的字词和编码字符串,通过和现有词库比较,从而得知 适合该用户使用习惯的新编码字符串;然后收集汇总各用户的新编码字符串及 其对应字词, 分析过滤后即可获得具有普遍意义的新编码字符串; 本发明从用 户输入的角度提供解决方案, 能够及时的、较为全面的获悉用户在输入过程中 的使用的新编码字符串, 包括反映用户的方言习惯的新编码字符串, 以及未知 的、 无法想象到的, 但是用户经常使用的新编码字符串, 进而提高首选词的准 确率。 First, the present invention proposes a distributed architecture, including multiple clients and a collector, through Extracting the words and encoded strings input by the user on the user side, and comparing with the existing thesaurus to learn a new encoded string suitable for the user's usage habit; then collecting and summarizing the new encoded strings of each user and their corresponding words The word, analysis and filtering can obtain a new code string with universal significance; the invention provides a solution from the perspective of user input, and can timely and comprehensively learn the new code string used by the user in the input process, including A new encoded string that reflects the user's dialect habits, as well as new, unimaginable, but often used, new encoded strings that improve the accuracy of preferred words.
其次,本发明将获得的新编码字符串及其字词放置到一精选互联网页面数 据库中, 对其出现次数进行统计, 得到互联网词频; 并根据用户词频在该字词 新旧编码字符串上的分布,将互联网词频修正后分别赋予新旧编码字符串, 则 可以得到最科学的词频结果,从而避免由于部分用户的使用习惯而影响其他正 常用户的输入效率和输入体验。  Secondly, the present invention places the obtained new encoded character string and its words into a selected Internet page database, and counts the number of occurrences thereof to obtain the Internet word frequency; and according to the user word frequency on the new and old encoded string of the word Distribution, after the Internet word frequency is corrected and given to the old and new coded strings, the most scientific word frequency results can be obtained, thereby avoiding the input efficiency and input experience of other normal users due to the usage habits of some users.
最后, 本发明还可以用于只收集统计一定区域的用户的新编码字符串,得 到该区域内用户的语言习惯或者编码习惯,从而可以提供各个区域不同的发音 或者编码版本的输入法系统或者在输入法系统中让用户设定自己所在的区域, 所需字词。  Finally, the present invention can also be used to collect only new coded strings of users who count a certain area, and obtain the language habits or coding habits of the users in the area, thereby providing an input method system of different pronunciations or coded versions of each area or The input method system allows the user to set the desired area of the area in which they are located.
附图说明 DRAWINGS
图 1 是所述获取输入法字词的新编码字符串的方法的一种优选实施例的 步骤流程图;  1 is a flow chart showing the steps of a preferred embodiment of the method for obtaining a new encoded character string of an input method word;
图 2是另一种获取输入法字词的新编码字符串的方法的步骤流程图; 图 3是所述输入法系统的结构框图;  2 is a flow chart showing the steps of another method for obtaining a new encoded character string of an input method word; FIG. 3 is a structural block diagram of the input method system;
图 4是所述词库生成装置的结构框图;  4 is a structural block diagram of the thesaurus generating apparatus;
图 5是用于确定新编码字符串的词库生成装置的结构框图;  Figure 5 is a block diagram showing the structure of a thesaurus generating apparatus for determining a new encoded character string;
图 6是用于确定过期字词的词库生成装置的结构框图。  Fig. 6 is a block diagram showing the structure of a thesaurus generating apparatus for determining expired words.
具体实施方式 detailed description
为使本发明的上述目的、特征和优点能够更加明显易懂, 下面结合附图和 具体实施方式对本发明作进一步详细的说明。  The present invention will be further described in detail with reference to the accompanying drawings and specific embodiments.
参照图 1 , 是本发明所述获取输入法字词的新编码字符串的方法的一种优 选实施例的步骤流程图, 包括以下步骤: Referring to FIG. 1, an excellent method for obtaining a new encoded character string of an input method word according to the present invention is shown in FIG. The flow chart of the steps of the selected embodiment includes the following steps:
步骤 101、 在用户输入过程中, 提取用户选择的字词, 以及用户输入的编 码字符串。  Step 101: In the user input process, extract the word selected by the user and the code string input by the user.
步骤 101可以通过输入法系统完成, 输入法系统可以在用户输入过程中, 釆用任何可行的方式提取用户选择的字词, 以及用户输入的编码字符串。提取 得到的信息可以直接进行步骤 102 , 或者先存储至用户词库, 间隔一定时间之 后将用户词库与系统词库进行比对。  Step 101 can be completed by the input method system, and the input method system can extract the words selected by the user and the encoded string input by the user in any feasible manner during the user input process. The extracted information can be directly performed in step 102, or stored in the user vocabulary, and the user vocabulary is compared with the system vocabulary after a certain time interval.
对于需要通过编码输入文字的语言而言, 用户都需要输入编码字符串, 并 在候选词中选择需要的字词,从而完成输入。 步骤 101就是记录用户的输入行 为信息之 用户所选择的字词及其输入的编码字符串。所述编码字符串可 以为拼音码也可以为字形码, 即本发明可以适用与各种输入法, 当然, 优选适 用与釆用音码输入的中文输入法。  For a language that needs to input text by encoding, the user needs to input the encoded string and select the desired word among the candidate words to complete the input. Step 101 is to record the word selected by the user who inputs the information of the user and the encoded string of the input. The encoded character string may be a pinyin code or a font code, that is, the present invention can be applied to various input methods, and of course, the Chinese input method suitable for inputting and using the code is preferably used.
用户所选字词中会包括一些该用户釆用方言习惯标注拼音码的字词, 例 如: "折了,,, 用户输入自认为正确的编码字符串 - - "zhele"; 但是输入法原有 的词库中该字词对应的字符串为 "shele" , 所以在候选词中无法直接显示给用 户, 用户需要对每个字进行选择从而得到需要的词汇。 这样的词还有很多, 例 如, 调头" diaotou,,、 "tiaotou"; 尿尿" niaoniao,,、 "niaosui"等等, 还有许多我们 无法统计的情况。通过本发明就可以尽快尽多的发现这样的新编码字符串,从 而提高用户输入中首选词的准确率。  The words selected by the user will include some words that the user is used to dialing the phonetic code in the dialect, for example: "Fold,,, the user enters the correct encoding string - - "zhele"; but the input method is original The word string corresponding to the word in the lexicon is "shele", so it cannot be directly displayed to the user in the candidate word, and the user needs to select each word to obtain the desired vocabulary. There are many such words, for example, Turning heads "diaotou,," "tiaotou"; urinary "niaoniao,," "niaosui", etc., there are many cases that we can't count. With this invention, we can find such new coded strings as soon as possible, thus Improve the accuracy of preferred words in user input.
再者, 用户还可以通过输入法提供的人工造词功能(例如微软拼音输入法 或者双拼输入法), 创造一些原有词库中没有, 但是该用户需要使用的字词对 应的新编码字符串, 这样用户在输入过程中才可以选择到所需的字词。 例如, "繁峙 "这个词, 是山西一地名, 在输入法中一般对应的编码字符串是" fanshi", "峙"这个字在输入法中一般对应的编码字符串为" shi、 zhi"两个, 但是该区域 的当地人一般习惯使用" fansi"来标识 "繁峙,,, 但是现有输入法中"峙,,这个字一 般没有 "si"这样的对应编码, 所以用户可以通过人工造词功能实现"峙,,和" si" 的对应, 或者"繁峙"和" fansi" 的对应。 通过本发明也能够从用户所选择输入 的字词及其编码字符串中挑选出该用户针对该字词自造的编码字符串。  Furthermore, the user can also create some new coded characters corresponding to the words that are not used in the original thesaurus by the artificial word-making function provided by the input method (for example, Microsoft Pinyin input method or double spell input method). String, so that the user can select the desired word during the input process. For example, the word "Traditional" is a place name in Shanxi. In the input method, the corresponding encoded string is "fanshi". The word "峙" in the input method generally corresponds to the code string "shi, zhi". Two, but the locals in the area are generally accustomed to using "fansi" to identify "Traditional,,, but in the existing input method", this word generally does not have a corresponding encoding such as "si", so the user can manually The word-making function realizes the correspondence between "峙,, and "si", or the correspondence between "Traditional" and "fani". The present invention can also select the user from the words and the encoded strings selected by the user. A self-created encoded string for the word.
步骤 102、将用户所选字词、用户输入的编码字符串与现有词库进行比对, 说明书统一釆用系统词库代理现有词库进行说明,因为一般的系统词库中就存 储有现有字词及其相应的编码字符串。 Step 102: Compare the selected word of the user and the encoded string input by the user with the existing thesaurus. The specification uses the system vocabulary agent to represent the existing vocabulary, because the existing system vocabulary stores the existing words and their corresponding code strings.
步骤 103、 如果用户所选字词在系统词库中存在, 但是用户输入的编码字 符串与系统词库中存储的该字词相应的编码字符串不同,则确定用户输入的编 码字符串为该字词相应的新编码字符串。  Step 103: If the word selected by the user exists in the system vocabulary, but the encoded string input by the user is different from the encoded string corresponding to the word stored in the system vocabulary, determining that the encoded string input by the user is The new encoded string corresponding to the word.
通过以上步骤 101-103就可以简单、 方便的自动获知该用户的编码习惯。 然后釆取各种方式收集多个用户的新编码字符串及其相应的字词 ,去除重复的 字词记录等过滤步骤之后, 就可以得到具有普遍意义上的新编码字符串。  Through the above steps 101-103, the user's coding habit can be automatically and simply obtained. Then, after collecting the new coded strings of multiple users and their corresponding words in various ways, and removing the filtering steps such as repeated word records, a new coded string in a general sense can be obtained.
所述收集可以为:输入法用户计算设备实时或者定时的将该用户的新编码 字符串及其相应的字词发送至字词收集计算设备中, 即优选的,输入法计算设 备具有一个自动发送的模块。优选的,所述收集计算设备以服务器的形式存在。  The collecting may be: the input method user computing device sends the new code string of the user and its corresponding word to the word collection computing device in real time or timing, that is, the input method computing device has an automatic sending Module. Preferably, the collection computing device exists in the form of a server.
所述收集还可以为输入法用户定时或者不定时的将自己的新编码字符串 及其相应的字词发送至收集端, 即所述发送由用户人工发起, 例如, 各用户将 自己的新编码字符串及其相应的字词发送至统一的邮件地址或者统一的服务 器中实现收集。 词库的情况时,可以将该存储有用户个性字词的词库实时或者定时的发送至收 集计算设备, 例如, 各个用户通过定时或者不定时的将词库在服务器备份即可 实现收集。  The collecting may also send the new code string and its corresponding word to the collecting end for the input method user periodically or irregularly, that is, the sending is manually initiated by the user, for example, each user will have his own new code. Strings and their corresponding words are sent to a unified email address or unified server for collection. In the case of the thesaurus, the vocabulary storing the user's personal words can be sent to the collection computing device in real time or periodically. For example, each user can collect the lexicon by backing up the server periodically or irregularly.
再者, 对于网络输入法(仅仅提供给用户输入接口和显示接口, 通过连接 服务器完成整个输入过程)而言, 其用户新编码字符串及其相应的字词的收集 就更简单了, 因为此时用户使用的输入法系统本身就是一个服务器, 可以供多 个用户使用, 在使用过程中就可以收集各用户的输入行为信息了。  Furthermore, for the network input method (provided only to the user input interface and display interface, through the connection server to complete the entire input process), the collection of the user's new code string and its corresponding words is simpler, because this The input method system used by the user is itself a server, which can be used by multiple users, and the input behavior information of each user can be collected during use.
实际上, 本发明釆用任何能够实现信息收集的方式都是可行的, 不再—— 列举说明。  In fact, the present invention is feasible in any way that enables information collection, and is no longer - an illustration.
为了达到最佳的效果, 图 1示出一个优选于上述步骤的实施例。 图 1所示 的优选实施例中, 步骤 101还包括: 在用户输入过程中, 记录用户词频至用户 户词库中包括多条字词记录, 所述字词记录包括该字词、相应的新编码字符串 以及相应的用户词频。 优选的, 步骤 101釆集用户词频的过程可以为: 才艮据用 户输入的当前应用程序, 分别加以相应的权重修正后统计词频信息,得到用户 词频。 In order to achieve the best results, Fig. 1 shows an embodiment which is preferably in the above steps. In the preferred embodiment shown in FIG. 1, step 101 further includes: recording user word frequency to the user during user input The household vocabulary includes a plurality of word records, the word records including the words, corresponding new encoded strings, and corresponding user word frequencies. Preferably, the process of collecting the user word frequency in step 101 may be: according to the current application input by the user, respectively, the corresponding word weight information is corrected by the corresponding weight, and the user word frequency is obtained.
图 1所示的优选实施例中, 还包括:  In the preferred embodiment shown in FIG. 1, the method further includes:
步骤 104、 收集各个用户的具有新编码字符串的字词记录, 所述记录包括 该字词、 相应的新编码字符串以及相应的用户词频;  Step 104: Collect a word record of each user with a new code string, where the record includes the word, a corresponding new code string, and a corresponding user word frequency;
步骤 105、 去除重复的字词记录。  Step 105: Remove duplicate word records.
步骤 106、 计算编码字符串相应的用户累积词频;  Step 106: Calculate a cumulative word frequency of the user corresponding to the encoded string;
所述用户累积词频的计算过程可以通过简单叠加各用户的用户词频即可 获得收集汇总后的该字词的用户累积词频。  The calculation process of the accumulated word frequency of the user can obtain the cumulative word frequency of the user after collecting the summarized words by simply superimposing the user word frequency of each user.
所述用户累积词频的计算过程也可以对字词相应的各用户词频分别进行 权重修正, 计算各字词的用户累积词频; 所述权重修正可以通过对某一字词相 应的各用户词频进行分析后完成, 例如, 首先对该字词相应的各用户词频进行 分析,找到分布趋势,通过某个词频值出现的概率或者该词频值距离平均范围 的大小对其进行修正。上述修正后计算得到的用户累积词频, 可以去除一些用 户的偶然行为或者恶意行为, 得到比价客观、 准确的用户累积词频, 进而保证 词库的准确性。  The calculation process of the accumulated word frequency of the user may also perform weight correction on each word frequency of each word corresponding to the word, and calculate the cumulative word frequency of the user of each word; the weight correction may be performed by analyzing the word frequency of each user corresponding to a certain word. After completion, for example, firstly, the word frequency of each user corresponding to the word is analyzed, and the distribution trend is found, and the probability of occurrence of a certain word frequency value or the frequency value of the word frequency is corrected from the average range. The cumulative word frequency calculated by the above-mentioned correction can remove some users' accidental behavior or malicious behavior, and obtain objective and accurate user cumulative word frequency, thereby ensuring the accuracy of the thesaurus.
步骤 107、 去除用户累积词频小于或者等于预置阔值的编码字符串。 该步 骤可以去除一些不具有普遍意义的个别用户的特殊输入习惯,可以保证获取的 新编码字符串的客观性和准确性。  Step 107: Remove the encoded character string whose user cumulative word frequency is less than or equal to the preset threshold. This step can remove the special input habits of some individual users who do not have universal significance, and can guarantee the objectivity and accuracy of the new encoded string obtained.
步骤 108、 统计过滤后的字词记录中的字词在预置的互联网页面数据库中 出现的次数, 得到互联网词频。  Step 108: Count the number of times the words in the filtered word record appear in the preset Internet page database, and obtain the Internet word frequency.
上述步骤 105-108的顺序并不是限定的,各步骤之间并没有严格的前后顺 序, 所以上述步骤 105-108的顺序仅仅是一个示意而已, 本领域技术人员可以 根据需要自行调整即可, 并不影响本发明的核心构思。  The order of the above steps 105-108 is not limited, and there is no strict sequence between the steps. Therefore, the order of the above steps 105-108 is merely an indication, and those skilled in the art can adjust themselves according to needs, and The core concept of the invention is not affected.
其中, 步骤 108之前还可以包括权重赋予步骤: 对互联网页面进行权重赋 值; 将权重值大于或者等于预置阔值的互联网页面存储至互联网页面数据库。 该过程为可选步骤, 其目的是为了获得一个精选的互联网页面数据库,从而可 以保证对新词筛选的准确性。 当然,也可以釆用其他方法形成预置的互联网页 面数据库。 The step 108 may further include a weight assignment step: assigning a weight to the Internet page; and storing the Internet page whose weight value is greater than or equal to the preset threshold to the Internet page database. The process is an optional step, the purpose of which is to obtain a database of selected Internet pages. To ensure the accuracy of the screening of new words. Of course, other methods can also be used to form a preset Internet page database.
在权重赋予的步骤中 ,根据网页形成的时间和网页的类型赋予相应的权重 值是一个比较重要的情形。 因为对于互联网词频统计而言, 网页时间对其的影 响非常重要, 所以网页时间对权重值的影响也就较大,距离词频统计的时间点 越远, 则权重值就越低, 如果时间差大于一定的值, 则可以赋予该网页较低的 权重值, 甚至排除在词频统计之外。 其次网页类型对词频统计的影响也很大, 所述网页类型一般是指门户网站、论坛或者其他一些已经确定的网页, 这些网 页的权重值就较高, 因为这些网页中参与者较多、 信息更新较快、 能够较好的 反应词频的最新变化趋势。 对于网页类型的判定, 可以通过设定一个规则库, 该库中存储了一些网页的 URL地址,从而确定这些 URL的网页是对词频统计 比较重要的,在这些网页上出现的字词会是优选统计的, 则对该网页赋予更大 的权重值。  In the step of weighting, it is a relatively important situation to assign a corresponding weight value according to the time formed by the webpage and the type of the webpage. Because for Internet word frequency statistics, the impact of web page time is very important, so the impact of web page time on the weight value is greater. The farther the time point from the word frequency statistics is, the lower the weight value is. If the time difference is greater than certain The value of the page can give the page a lower weight value, even excluded from the word frequency statistics. Secondly, the type of webpage has a great influence on the word frequency statistics. The webpage type generally refers to a portal website, a forum or some other determined webpages. The weight value of these webpages is higher because there are more participants and information in these webpages. Update the latest trends that are faster and better able to respond to word frequency. For the determination of the webpage type, a rule base can be set, and the URL addresses of some webpages are stored in the library, thereby determining that the webpages of these URLs are more important for word frequency statistics, and the words appearing on these webpages are preferred. For statistical purposes, the web page is given a greater weight value.
其次, 本发明还可以通过赋予较低权重值的方式去除一些重复网页、黄色 网页和垃圾网页, 从而可以进一步保证新词验证的准确性。  Secondly, the present invention can further remove some duplicate web pages, yellow web pages and spam web pages by giving lower weight values, thereby further ensuring the accuracy of new word verification.
通过步骤 108得到字词的互联网词频后, 可以设置词库包括字词、 字词相 应的互联网词频、 以及该字词相应的用户累积词频。 例如, "重重 "一词在词库 就具有一个互联 网 词 频和 两 个用 户 累 积词 频 , 分别 对应 "chongchong""zhongzhong"0 釆用互联网词频可以提高词频的准确度, 但是由 于字词在互联网中无法体现编码字符串 ,故可以通过用户累积词频来体现用户 的输入习惯, 提高首选词的命中率。 After obtaining the Internet word frequency of the word through step 108, the vocabulary can be set to include the word, the corresponding Internet word frequency of the word, and the corresponding user cumulative word frequency of the word. For example, the word "heavy" has an Internet word frequency and two user cumulative word frequencies in the thesaurus, corresponding to "chongchong""zhongzhong" 0. Internet word frequency can improve the accuracy of word frequency, but because words cannot be found on the Internet. Reflecting the encoded string, the user can accumulate the word frequency to reflect the user's input habits and improve the hit rate of the preferred word.
当然, 也可以不需要步骤 108, 则设置词库包括字词、 字词原来的词频、 以及该字词相应的用户累积词频即可。  Of course, step 108 may not be needed, and the vocabulary may be included to include the word, the original word frequency of the word, and the corresponding cumulative word frequency of the word.
上述一个字词对应两个词频使用过程比较复杂,需要两类词频数据配合使 用才可以达到最佳的效果, 为了进一步简化, 图 1所示的优选实施例还可以包 括步骤 109, 将上述两类词频数据调整为一种词频数据。  The above-mentioned one word corresponds to two word frequency. The use process is complicated, and two types of word frequency data are needed to achieve the best effect. For further simplification, the preferred embodiment shown in FIG. 1 may further include step 109, the above two types. The word frequency data is adjusted to a word frequency data.
步骤 109、根据该字词的新编码字符串的用户累积词频与原编码字符串的 用户累积词频的比例,分配其互联网词频至该字词的两个或者多个相应编码字 符串。 即认为互联网中出现的字词对应了两个或者多个相应编码字符串,根据用 户输入该编码字符串的累积词频的不同,将反映该字词总词频的互联网词频分 配至该字词的两个或者多个相应编码字符串,从而客观、 准确的体现用户的输 入习惯, 提高用户输入过程中首选词的准确率。 Step 109: Allocate two or more corresponding encoded character strings of the Internet word frequency to the word according to the ratio of the user cumulative word frequency of the new encoded character string of the word to the cumulative word frequency of the original encoded character string. That is to say, the words appearing in the Internet correspond to two or more corresponding encoded character strings, and according to the difference of the cumulative word frequency of the input character string input by the user, the Internet word frequency reflecting the total word frequency of the word is allocated to the two words of the word. One or more corresponding encoding strings, thereby objectively and accurately reflecting the user's input habits, and improving the accuracy of the preferred words in the user input process.
当然, 步骤 109在互联网词频的分配上仅仅给出了一个例子, 而实际应用 中,在互联网验证后的词频分配上,原编码词频和新编码词频的比较方法可以 有很多种, 例如, 线性、 非线性、 平滑调整等等, 然后计算出一个比例, 再分 配其互联网词频至该字词的两个或者多个相应编码字符串, 在此不进行详述 了。  Of course, step 109 only gives an example in the allocation of Internet word frequency. In practical applications, there are many ways to compare the original code word frequency and the new code word frequency in the word frequency allocation after Internet verification, for example, linear, Non-linear, smooth adjustment, etc., then calculate a ratio, and then allocate its Internet word frequency to two or more corresponding encoded strings of the word, which will not be detailed here.
步骤 1010、 根据过滤后的字词记录生成新词库或者将过滤后的字词记录 添加至原有词库, 得到新词库或者新版的全词库。 所述字词记录包括该字词、 相应的新编码字符串以及相应的词频信息。  Step 1010: Generate a new thesaurus according to the filtered word records or add the filtered word records to the original thesaurus to obtain a new thesaurus or a new version of the full thesaurus. The word record includes the word, the corresponding new encoded string, and corresponding word frequency information.
图 1所示实施例可以用于收集全国范围内的用户新编码字符串,然后得出 适合大多数人使用的新词库或者新版的全词库,从而提高各区域用户的输入体 验。  The embodiment shown in Fig. 1 can be used to collect new code strings for users nationwide, and then to derive a new thesaurus or a new version of the full thesaurus suitable for most people, thereby improving the input experience of users in various regions.
图 1所示实施例还可以用于以下情况:收集的仍然是全国范围内的用户新 编码字符串, 所述收集的信息还包括用户所在的区域信息,将用户划分为若干 区域; 针对每个区域进行过滤步骤; 针对每个区域生成区域新词库或者新版的 区域全词库。 即可以分别统计各个区域的人们的不同的发音,提供各个区域不 同的编码版本的输入法系统或者在输入法系统中让用户设定自己所在的区域, 从而更加个性的满足各区域内用户的编码输入习惯。  The embodiment shown in FIG. 1 can also be used in the following cases: the collected new user-encoded character string is still collected, and the collected information further includes the area information of the user, and the user is divided into several areas; The area performs the filtering step; a new regional vocabulary or a new version of the regional vocabulary is generated for each region. That is, the different pronunciations of people in each area can be separately counted, the input method system of different coding versions of each area can be provided, or the user can set the area where the user is located in the input method system, thereby more personalizedly satisfying the coding of users in each area. Input habits.
上述步骤得到的新词库或者新版的全词库, 可以用于更新输入法。  The new thesaurus obtained in the above steps or the new version of the full thesaurus can be used to update the input method.
例如, 用于更新普通输入法: 设置包含系统词库的输入法系统位于第一计 算设备中,得到的新词库或者新版的全词库位于第二计算设备中; 需要更新词 库的输入法系统通过第一计算设备连接所述第二计算设备完成系统词库的更 新。  For example, for updating the normal input method: setting an input method system including a system vocabulary in the first computing device, and obtaining a new vocabulary or a new version of the full vocabulary in the second computing device; The system connects the second computing device to the second computing device to complete the update of the system vocabulary.
所述存储得到的新词库或者新版的全词库的第二计算设备可以通过服务 器的形式存在于网络中,向其他任何需要输入法新词信息的客户端程序提供词 库更新服务。 当然, 并不需要一定通过固定服务器的形式出现, 也可以存在于 某个本地计算设备中, 通过 P2P (点对点)技术向其他终端的任何需要输入法 新词信息的客户端程序提供词库更新服务。 The stored new vocabulary or the new version of the second vocabulary second computing device may exist in the network in the form of a server, and provide a vocabulary update service to any other client program that needs to input the new vocabulary information. Of course, it does not need to be in the form of a fixed server. It can also exist in In a local computing device, the vocabulary update service is provided to any client terminal of other terminals that needs to input the new word information through P2P (peer-to-peer) technology.
上述更新的实施例中, 所述更新的方式可以为: 当输入法系统更新时, 同 时更新所述系统词库; 或者, 由服务器主动推送的方式进行系统词库的在线更 新; 或者, 由用户发起请求, 服务器根据请求返回数据进行系统词库的更新。 当然, 也可以釆用移动存储器更新的方式或者版本更新的方式。 总之, 可以釆 用各种数据更新的方式, 本发明对此并不加以限定, 本领域技术人员可以根据 需要选择即可。  In the above updated embodiment, the updating may be performed by: updating the system vocabulary at the same time when the input method system is updated; or performing online update of the system vocabulary by means of the server actively pushing; or, by the user The request is initiated, and the server returns data according to the request to update the system vocabulary. Of course, you can also use the way of mobile memory update or version update. In short, various data update methods may be used, and the present invention is not limited thereto, and those skilled in the art may select them according to needs.
再例如, 用于更新网络输入法: 设置输入法系统中用于接收用户输入信息 和显示相应字符的单元位于第一计算设备中;得到的新词库或者新版的全词库 为输入法系统的系统词库, 所述系统词库位于第二计算设备中; 所述输入法系 统根据用户输入的信息,从位于第二计算设备中的系统词库获取相应信息,在 第一计算设备显示相应字符, 完成文字输入。  For another example, for updating the network input method: setting a unit for receiving user input information and displaying corresponding characters in the input method system is located in the first computing device; the obtained new thesaurus or the new version of the full thesaurus is the input method system a system vocabulary, the system vocabulary is located in a second computing device; the input method system obtains corresponding information from a system vocabulary located in the second computing device according to information input by the user, and displays corresponding characters in the first computing device , complete the text input.
上例中可以直接将根据本发明获得的新词库或者新版的全词库直接作为 输入法系统的系统词库, 则可以实现在线词库使用, 而不需要更新操作了。 其 中, 将输入法系统分为了两部分, 接收和显示单元位于第一计算设备, 词库信 息则位于第二计算设备, 可以完美的实现输入法的在线应用; 当然, 对于输入 法系统需要的编码匹配过程可以根据需要任意设置在某个计算设备中均可。  In the above example, the new thesaurus or the new version of the whole thesaurus obtained according to the present invention can be directly used as the system vocabulary of the input method system, and the online thesaurus can be used without updating operations. Wherein, the input method system is divided into two parts, the receiving and displaying unit is located in the first computing device, and the thesaurus information is located in the second computing device, which can perfectly implement the online application of the input method; of course, the encoding required for the input method system The matching process can be arbitrarily set in a computing device as needed.
参照图 2, 为本发明另一种获取输入法字词的新编码字符串的方法的步骤 流程图, 包括以下步骤:  Referring to FIG. 2, a flow chart of another method for obtaining a new encoded character string of an input method word according to the present invention includes the following steps:
步骤 201、 提取用户在输入过程中所选择的字词, 以及用户输入的编码字 符串, 并存储至用户词库;  Step 201: Extract a word selected by the user in the input process, and a coded character string input by the user, and store the coded character string in the user vocabulary;
步骤 202、 收集各个用户的用户词库;  Step 202: Collect a user vocabulary of each user;
步骤 203、 对比所述收集的用户词库和输入法系统词库, 所述系统词库中 存储有字词及其相应的编码字符串;  Step 203: Compare the collected user vocabulary and the input method system vocabulary, where the system vocabulary stores words and corresponding code strings;
步骤 204、 根据预置规则, 确定字词相应的新编码字符串。  Step 204: Determine, according to the preset rule, a new coded string corresponding to the word.
所述预置规则可以为: 如果用户所选字词在系统词库中存在,但是用户输 入的编码字符串与系统词库中存储的该字词相应的编码字符串不同,则确定用 户输入的编码字符串为该字词相应的新编码字符串。 或者所述预置的规则也可以为:如果用户所选字词及用户输入的编码字符 串在现有词库中都存在,则进一步比较该字词相应的编码字符串的用户累积词 频和系统词频, 所述系统词频为在现有词库中预置的现有字词相应的词频信 息,如果用户累积词频与系统词频的比值大于或者等于预定阔值, 则确定用户 输入的编码字符串为该字词相应的新编码字符串。 The preset rule may be: if the word selected by the user exists in the system lexicon, but the encoded string input by the user is different from the encoded string corresponding to the word stored in the system vocabulary, then the user input is determined. The encoded string is the new encoded string corresponding to the word. Or the preset rule may be: if the selected word of the user and the encoded string input by the user exist in the existing thesaurus, the user cumulative word frequency and system for further comparing the corresponding encoded string of the word Word frequency, the word frequency of the system is the word frequency information corresponding to the existing words preset in the existing lexicon. If the ratio of the cumulative word frequency of the user to the system word frequency is greater than or equal to the predetermined threshold, the code string input by the user is determined to be The new encoded string corresponding to the word.
本领域技术人员也可以将上述的预置规则进行结合而加以使用,也可以根 据需要自行设定规则, 本发明并不加以限定。  Those skilled in the art can also use the above preset rules in combination, and can also set rules according to the needs, and the present invention is not limited thereto.
图 2所示实施例与图 1所示实施例的构思基本相似, 主要区别在于,先收 集多个用户的用户词库, 再统一进行比对,根据比对结果获取用户的新编码字 符串; 该方式可以减少比对计算的次数, 并可以减少本地输入法系统的负担, 可以直接用于现有的输入法系统,但是由于汇集了大量用户所选字词之后才进 行比对, 会增加服务器的系统负担。 对于二者, 本领域技术人员根据需要选择 使用即可。  The embodiment shown in FIG. 2 is basically similar to the concept of the embodiment shown in FIG. 1. The main difference is that the user vocabulary of multiple users is collected first, and then the comparison is performed uniformly, and the new encoded character string of the user is obtained according to the comparison result; This method can reduce the number of comparison calculations, and can reduce the burden of the local input method system. It can be directly used in the existing input method system, but the comparison will be performed after a large number of user-selected words are collected, and the server will be added. System burden. For both, those skilled in the art can select and use as needed.
优选的, 图 2所示的实施例还可以包括过滤步骤: 其中, 所述用户词库中 计算编码字符串相应的用户累积词频;去除用户累积词频小于或者等于预置阔 值的编码字符串。  Preferably, the embodiment shown in FIG. 2 may further include a filtering step: wherein the user vocabulary calculates a user cumulative word frequency corresponding to the encoded character string; and removes the encoded character string whose user cumulative word frequency is less than or equal to the preset threshold.
优选的, 图 2所示的实施例还可以包括词频赋予步骤: 统计具有新编码字 符串的字词在预置的互联网页面数据库中出现的次数,得到互联网词频; 比较 该字词的新编码字符串的用户累积词频与原编码字符串的用户累积词频,才艮据 比较结果, 分配其互联网词频至该字词的两个或者多个相应编码字符串。  Preferably, the embodiment shown in FIG. 2 may further include a word frequency giving step: counting the number of occurrences of the word with the new encoded character string in the preset Internet page database, obtaining the Internet word frequency; comparing the new coded characters of the word The user of the string accumulates the word frequency and the user's accumulated word frequency of the original encoded string, and according to the comparison result, the Internet word frequency is allocated to two or more corresponding encoded strings of the word.
参照图 3 , 为本发明一种输入法系统的结构框图, 包括输入接口单元 301、 显示单元 302以及系统词库 303 , 该输入法系统还包括:  Referring to FIG. 3, it is a structural block diagram of an input method system according to the present invention, which includes an input interface unit 301, a display unit 302, and a system vocabulary 303. The input method system further includes:
字词提取单元 304 , 与输入法系统相连, 用于在用户输入过程中, 提取用 户选择的字词, 以及用户输入的编码字符串;  a word extracting unit 304, connected to the input method system, for extracting a word selected by the user and a coded string input by the user during the user input process;
字词比对单元 305 , 与字词提取单元 304相连, 用于将用户所选字词、 用 户输入的编码字符串与系统词库进行比对,所述系统词库中存储有字词及其相 应的编码字符串; 根据预置规则, 确定字词相应的新编码字符串。  The word matching unit 305 is connected to the word extracting unit 304, and is configured to compare the selected word of the user, the encoded string input by the user, and the system vocabulary, wherein the system vocabulary stores the word and Corresponding encoded string; According to the preset rules, the new encoded string corresponding to the word is determined.
所述预置规则可以为: 如果用户所选字词在系统词库中存在,但是用户输 入的编码字符串与系统词库中存储的该字词相应的编码字符串不同,则确定用 户输入的编码字符串为该字词相应的新编码字符串。 The preset rule may be: if the user selected word exists in the system vocabulary, but the user loses If the encoded string entered is different from the encoded string corresponding to the word stored in the system vocabulary, it is determined that the encoded string input by the user is a new encoded string corresponding to the word.
即上述输入法系统除了用于普通的字词输入,还可以用于提取用户的新编 码字符串。 上述输入法系统可以为普通输入法系统, 例如, 所述输入法系统的 输入接口单元 301、 显示单元 302以及系统词库 303位于同一计算设备中, 该 输入法系统才艮据用户输入的编码信息通过本地查询匹配在本地显示相应字符。 上述输入法系统也可以为网络输入法系统, 例如, 所述输入法系统的输入接口 单元 301、 显示单元 302位于第一计算设备中, 系统词库 303位于第二计算设 备中, 所述输入法系统根据用户输入的信息,从位于第二计算设备中获取相应 信息, 在第一计算设备显示相应字符。  That is, the above input method system can be used to extract a new code string of the user in addition to the ordinary word input. The input method system may be a common input method system. For example, the input interface unit 301, the display unit 302, and the system vocabulary 303 of the input method system are located in the same computing device, and the input method system is based on the encoded information input by the user. The corresponding characters are displayed locally by local query matching. The input method system may also be a network input method system. For example, the input interface unit 301 and the display unit 302 of the input method system are located in a first computing device, and the system vocabulary 303 is located in a second computing device. The system obtains corresponding information from the second computing device according to the information input by the user, and displays the corresponding character in the first computing device.
为了能够将提取的该用户的新编码字符串发送至收集设备,进而得到具有 普遍意义上的新编码字符串, 则所述输入法系统优选的, 还包括: 通信单元 In order to be able to send the extracted new encoded character string of the user to the collecting device, thereby obtaining a new encoded character string in a general sense, the input method system preferably further comprises: a communication unit
306 , 用于实时或者定时的发送具有新编码字符串的字词记录, 所述字词记录 包括该字词以及其相应的新编码字符串。 306. Send a word record with a new encoded string in real time or timing, the word record including the word and its corresponding new encoded string.
为了可以通过用户词频对各用户的新编码字符串进行过滤,得到客观正确 的结果, 则所述输入法系统优选的, 还包括: 词频记录单元 307 , 与输入法系 统相连, 用于在用户输入过程中, 记录用户词频, 所述用户词频为该用户所输 入的编码字符串相应的频率信息; 用户词库 308 , 用于存储用户所选字词、 用 户输入的编码字符串及其相应的用户词频。  In order to filter the new coded string of each user by the user word frequency to obtain an objective and correct result, the input method system preferably further includes: a word frequency recording unit 307 connected to the input method system for inputting at the user In the process, the user word frequency is recorded, and the user word frequency is frequency information corresponding to the code string input by the user; the user vocabulary 308 is configured to store the word selected by the user, the code string input by the user, and the corresponding user. Word frequency.
上述的输入法系统中的输入接口单元 301 最重要的是可以用于提供使用 者进行信息输入、 字词选取的动作; 还可以用于进行各种模式的切换工作, 例 如: 输入语言的切换 (如: 简体繁体、 中文英文的切换)、 输入模式的切换 (如: 单字输入、 词输入、 句子输入的切换)、 输入状态的切换 (如: 文字、 标点符号、 特殊符号的切换)等等。 显示单元 302以及系统词库 303都为本领域技术人员 所熟知之信息, 在此不再详述。  The input interface unit 301 in the above input method system is most important for providing the user with information input and word selection; and can also be used for switching various modes, for example: input language switching ( Such as: Simplified and Traditional Chinese, Chinese and English switching), input mode switching (such as: single-word input, word input, sentence input switching), input state switching (such as: text, punctuation, special symbol switching) and so on. The display unit 302 and the system vocabulary 303 are all well known to those skilled in the art and will not be described in detail herein.
图 3所示的输入法系统, 还可以包括: 应用程序判断单元 309 , 用于判断 用户输入的当前应用程序, 并将判断结果发送至词频记录单元 307 ; 所述词频 记录单元 307 , 用于在用户输入过程中, 根据用户输入的当前应用程序, 分别 统计词频信息, 并#文相应的权重爹正, 形成用户词频。 即该输入法系统可以根据用户输入的当前应用程序,而分别加以相应的权 重赋值后统计词频, 例如, 由于本发明优选的可以统计得到互联网词频, 故考 值; 而用户在网络社区论坛输入的字词, 因为可以从互联网统计出来, 则可以 赋予相对较低的权重值。 The input method system shown in FIG. 3 may further include: an application determining unit 309, configured to determine a current application input by the user, and send the determination result to the word frequency recording unit 307; the word frequency recording unit 307 is configured to During the user input process, according to the current application input by the user, the word frequency information is separately counted, and the corresponding weights of the texts are corrected to form a user word frequency. That is, the input method system can respectively perform the corresponding weight assignment and the statistic word frequency according to the current application input by the user. For example, since the preferred method of the present invention can statistically obtain the Internet word frequency, the value is calculated; and the user inputs in the online community forum. Words, because they can be counted from the Internet, can be given a relatively low weight value.
参照图 4, 是本发明一种词库生成装置的结构框图, 包括以下部件: 收集单元 401 , 用于收集各个用户的具有新编码字符串的字词记录, 所述 字词记录包括该字词以及其相应的新编码字符串。  4 is a structural block diagram of a thesaurus generating apparatus of the present invention, which includes the following components: a collecting unit 401, configured to collect a word record of each user having a new encoded character string, the word record including the word And its corresponding new encoded string.
所述词库生成装置可以釆用服务器实现,所述收集可以釆用前述的各种方 式实现。 所述用户的具有新编码字符串的字词记录可以通过输入法获取, 自动 发送至收集单元 401 ; 也可以由用户自行设定或者整理,发送至收集单元 401 ; 或者各用户将其具有新编码字符串的字词记录汇集至一固定的网络空间,所述 收集单元 401从该网络空间中获取各个用户的具有新编码字符串的字词记录。 即本实施例中的用户具有新编码字符串的字词记录并不一定是通过用户输入 行为获取的, 也可以是用户自行设定或者整理的。  The vocabulary generating means can be implemented by a server, and the collecting can be implemented by various methods as described above. The user's word record with the new encoded character string can be obtained by the input method and automatically sent to the collecting unit 401; or can be set or organized by the user and sent to the collecting unit 401; or each user has a new encoding The word records of the string are collected into a fixed network space, and the collecting unit 401 acquires a word record of each user having a new encoded character string from the network space. That is, the word record of the user having the newly encoded character string in the embodiment is not necessarily obtained by the user input behavior, and may be set or organized by the user.
第一过滤单元 402, 用于去除重复的字词记录;  a first filtering unit 402, configured to remove duplicate word records;
词库生成单元 403 , 用于根据过滤后的字词记录生成新词库或者将过滤后 的字词记录添加至原有词库, 得到新词库或者新版的全词库。  The thesaurus generating unit 403 is configured to generate a new thesaurus according to the filtered word records or add the filtered word records to the original thesaurus to obtain a new thesaurus or a new version of the full thesaurus.
为了去除使用频率较低的编码字符串,得到客观正确的结果, 则所述词库 生成装置优选的,还包括: 所述收集单元 401还用于收集用户输入行为中的用 户词频, 所述用户词频为该用户所输入的编码字符串相应的频率信息; 累积词 频计算单元 404 , 用于计算编码字符串相应的用户累积词频; 第二过滤单元 405 , 用于去除用户累积词频小于或者等于预置阔值的编码字符串。 其中, 对 用户词频的统计, 优选的, 还可以根据用户输入的当前应用程序, 而分别加以 相应的权重 H武值后统计词频。  In order to obtain an objective and correct result, the vocabulary generating device is further configured to: collect the user word frequency in the user input behavior, the user The word frequency is the frequency information corresponding to the encoded character string input by the user; the cumulative word frequency calculation unit 404 is configured to calculate the user cumulative word frequency corresponding to the encoded character string; and the second filtering unit 405 is configured to remove the user cumulative word frequency less than or equal to the preset. A wide-valued encoded string. Wherein, the statistics of the user's word frequency, preferably, may also be respectively based on the current application input by the user, and the corresponding weights are used.
为了赋予上述新编码字符串相应的、 比较精确的词频信息, 则所述词库生 成装置优选的, 还包括:  In order to provide corresponding and relatively accurate word frequency information of the above new code string, the vocabulary generating device preferably further includes:
互联网页面数据库生成单元 406, 用于对互联网页面进行权重赋值; 并将 权重值大于或者等于预置阔值的互联网页面存储至互联网页面数据库。 统计单元 407 , 用于统计过滤后的字词记录中的字词在预置的互联网页面 数据库中出现的次数, 得到互联网词频。 The Internet page database generating unit 406 is configured to perform weight assignment on the Internet page; and store the Internet page whose weight value is greater than or equal to the preset threshold to the Internet page database. The statistic unit 407 is configured to count the number of occurrences of the words in the filtered word record in the preset Internet page database, and obtain the Internet word frequency.
词频分配单元 408: 用于比较该字词的新编码字符串的用户累积词频与原 编码字符串的用户累积词频,根据比较结果, 分配其互联网词频至该字词的两 个或者多个相应编码字符串。其中, 所述的原编码字符串对应的用户累积词频 可以通过其他途径获得, 或者在收集单元 401中, 可以同时收集该字词的原编 码字符串以及其对应的用户词频信息,对各个用户的用户词频进行计算得到用 户累积词频。  a word frequency allocation unit 408: a user cumulative word frequency for comparing a new encoded character string of the word with a user cumulative word frequency of the original encoded character string, and according to the comparison result, assigning the Internet word frequency to two or more corresponding codes of the word String. The user accumulated word frequency corresponding to the original encoded character string may be obtained by other means, or in the collecting unit 401, the original encoded character string of the word and the corresponding user word frequency information may be simultaneously collected, for each user. The user word frequency is calculated to obtain the user cumulative word frequency.
参照图 5 , 本发明还提供了一种词库生成装置, 包括以下部件: 收集单元 501 , 用于收集各用户的输入行为信息, 所述输入行为信息包括 用户选择的字词, 用户输入的编码字符串以及相应的用户词频, 所述用户词频 为用户输入该字词及其相应编码字符串的频率信息;  Referring to FIG. 5, the present invention further provides a thesaurus generating apparatus, including the following components: The collecting unit 501 is configured to collect input behavior information of each user, where the input behavior information includes a word selected by a user, and a code input by the user. a character string and a corresponding user word frequency, wherein the user word frequency is frequency information of the user inputting the word and the corresponding encoded character string;
累积词频计算单元 502 , 对字词与编码字符串整体相应的各用户词频进行 权重修正, 计算该字词与编码字符串整体用户累积词频;  The cumulative word frequency calculation unit 502 performs weight correction on the word frequency of each user corresponding to the word and the encoded character string, and calculates the cumulative word frequency of the word and the encoded user as a whole;
词库生成单元 503 , 用于生成词库, 所述词库包括字词、 编码字符串及其 相应的词频信息。  The thesaurus generating unit 503 is configured to generate a thesaurus, the thesaurus includes a word, an encoded string, and corresponding word frequency information.
图 5所示的词库生成装置还可以包括: 比对单元 504 , 用于对比所述生成 的词库和现有词库, 所述现有词库中存储有字词、编码字符串及其相应的系统 词频; 新编码字符串确定单元 505 , 用于根据预置规则, 确定字词相应的新编 码字符串。 则所述词库生成装置可以实现新编码字符串的获取。  The thesaurus generating apparatus shown in FIG. 5 may further include: a matching unit 504, configured to compare the generated thesaurus and the existing thesaurus, wherein the existing thesaurus stores words, encoded strings, and The corresponding system word frequency; the new code string determining unit 505 is configured to determine a new code string corresponding to the word according to the preset rule. Then, the thesaurus generating device can implement acquisition of a new encoded string.
所述词库生成装置优选的, 还包括: 过滤单元 506 , 用于去除用户累积词 频小于或者等于预置阔值的编码字符串。  Preferably, the vocabulary generating device further includes: a filtering unit 506, configured to remove an encoded character string whose user accumulated word frequency is less than or equal to a preset threshold.
为了赋予上述新编码字符串相应的、 比较精确的词频信息, 则所述词库生 成装置优选的, 还包括:  In order to provide corresponding and relatively accurate word frequency information of the above new code string, the vocabulary generating device preferably further includes:
统计单元 507 , 统计具有新编码字符串的字词在预置的互联网页面数据库 中出现的次数,得到互联网词频。 所述互联网页面数据库通过对互联网页面进 行权重赋值;并将权重值大于或者等于预置阔值的互联网页面存储而形成互联 网页面数据库。  The statistical unit 507 counts the number of occurrences of the word with the new encoded string in the preset Internet page database, and obtains the Internet word frequency. The Internet page database forms a Internet page database by weighting the Internet page and storing the Internet page whose weight value is greater than or equal to the preset threshold.
词频分配单元 508: 用于比较该字词的新编码字符串的用户累积词频与原 编码字符串的用户累积词频,根据比较结果, 分配其互联网词频至该字词的两 个或者多个相应编码字符串。 Word frequency allocation unit 508: user cumulative word frequency used to compare the new encoded string of the word with the original The user of the encoded string accumulates the word frequency, and according to the comparison result, allocates the Internet word frequency to two or more corresponding encoded strings of the word.
参照图 6, 示出了另一种词库生成装置, 包括以下部件:  Referring to Figure 6, another lexicon generating apparatus is shown, which includes the following components:
收集单元 601 , 用于收集各用户的输入行为信息, 所述输入行为信息包括 用户选择的字词, 用户输入的编码字符串以及相应的用户词频, 所述用户词频 为用户输入该字词及其相应编码字符串的频率信息;  The collecting unit 601 is configured to collect input behavior information of each user, where the input behavior information includes a word selected by the user, a coded string input by the user, and a corresponding user word frequency, where the user word frequency is the user inputting the word and Corresponding to the frequency information of the encoded string;
累积词频计算单元 602, 对字词与编码字符串整体相应的各用户词频进行 权重修正, 计算该字词与编码字符串整体用户累积词频;  The cumulative word frequency calculation unit 602 performs weight correction on the word frequency of each user corresponding to the word and the encoded character string, and calculates the cumulative word frequency of the word and the encoded user as a whole;
词库生成单元 603 , 用于生成词库, 所述词库包括字词、 编码字符串及其 相应的词频信息。  The thesaurus generating unit 603 is configured to generate a thesaurus, the thesaurus includes words, encoded strings, and corresponding word frequency information.
比对单元 604 , 用于对比所述生成的词库和现有词库, 所述现有词库中存 储有字词、 编码字符串及其相应的系统词频;  The comparison unit 604 is configured to compare the generated thesaurus and the existing thesaurus, wherein the existing thesaurus stores the words, the encoded strings, and the corresponding system word frequencies;
过期字词确定单元 605 , 用于确定过期字词, 所述过期字词为在所述生成 的词库中不存在,但是在现有词库中存在的字词; 或者在所述生成的词库中用 户累积词频符合预置条件的字词。例如,用户累积词频小于或者等于预定阔值。  An expired word determining unit 605, configured to determine an expired word, the expired word is a word that does not exist in the generated thesaurus, but exists in an existing thesaurus; or the generated word The words in the library that accumulate the word frequency according to the preset conditions. For example, the user cumulative word frequency is less than or equal to a predetermined threshold.
图 6所示装置得到过期字词之后,就可以根据这些过期字词对现有词库进 行精简, 以防止现有词库越来越大的问题出现, 例如, 从现有词库中过滤、 删 除所述过期字词, 从而缩减词库容量, 提高词库利用率, 提高输入效率。  After the device shown in Figure 6 gets the expired words, the existing thesaurus can be streamlined according to these expired words to prevent the existing thesaurus from getting bigger and bigger, for example, filtering from the existing thesaurus, The expired words are deleted, thereby reducing the volume of the thesaurus, improving the utilization of the thesaurus, and improving the input efficiency.
由于本发明篇幅有限,在方法的描述部分较为详细, 系统部分的描述未详 尽之处。 请参见前述相关部分。  Due to the limited space of the present invention, the description of the method is more detailed, and the description of the system portion is not detailed. Please see the relevant section above.
以上对本发明所提供的一种获取输入法字词的新编码字符串的方法、一种 输入法系统以及一种词库生成装置, 进行了详细介绍, 本文中应用了具体个例 对本发明的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解 本发明的方法及其核心思想; 同时, 对于本领域的一般技术人员, 依据本发明 的思想, 在具体实施方式及应用范围上均会有改变之处, 综上所述, 本说明书 内容不应理解为对本发明的限制。  The above provides a method for obtaining a new encoded character string of an input method word, an input method system and a vocabulary generating device provided by the present invention, and a specific example is applied to the principle of the present invention. The embodiments have been described, and the description of the above embodiments is only for helping to understand the method of the present invention and its core ideas. Meanwhile, for those skilled in the art, according to the idea of the present invention, in the specific embodiments and application scope There are variations, and the description should not be construed as limiting the invention.

Claims

权 利 要 求 Rights request
1、 一种获取输入法字词的新编码字符串的方法, 其特征在于, 包括: 提取用户在输入过程中所选择的字词, 以及用户输入的编码字符串; 将用户所选字词、用户输入的编码字符串与现有词库进行比对, 所述现有 词库中存储有现有字词及其相应的编码字符串;  A method for obtaining a new encoded character string of an input method word, comprising: extracting a word selected by a user in an input process, and an encoded character string input by a user; The encoded string input by the user is compared with an existing thesaurus, wherein the existing thesaurus stores the existing words and their corresponding encoded strings;
根据预置规则, 确定字词相应的新编码字符串。  Determine the new encoded string corresponding to the word according to the preset rules.
2、 如权利要求 1所述的方法, 其特征在于, 还包括:  2. The method of claim 1, further comprising:
将用户所选字词、 用户输入的编码字符串记录至用户词库;  Record the word selected by the user and the encoded string input by the user to the user vocabulary;
并在用户输入过程中,记录用户词频至用户词库, 所述用户词频为用户输 入该字词及其相应编码字符串的频率信息。  And during the user input process, the user word frequency is recorded to the user vocabulary, and the user word frequency is the frequency information of the user and the corresponding encoded character string.
3、 如权利要求 2所述的方法, 其特征在于, 还包括:  3. The method of claim 2, further comprising:
根据用户输入的当前应用程序, 分别加以相应的权重修正后统计词频信 息, 得到用户词频。  According to the current application input by the user, the corresponding word weight information is corrected by corresponding weight correction, and the user word frequency is obtained.
4、 如权利要求 2所述的方法, 其特征在于, 还包括:  4. The method of claim 2, further comprising:
收集各个用户的具有新编码字符串的字词记录, 所述记录包括该字词、相 应的新编码字符串以及相应的词频信息;  Collecting a word record of each user having a new encoded string, the record including the word, a corresponding new encoded string, and corresponding word frequency information;
去除重复的字词记录。  Remove duplicate word records.
5、 如权利要求 4所述的方法, 其特征在于, 还包括:  5. The method of claim 4, further comprising:
计算用户累积词频;  Calculate the cumulative word frequency of the user;
去除用户累积词频小于或者等于预置阔值的编码字符串。  The encoded string whose user cumulative word frequency is less than or equal to the preset threshold is removed.
6、 如权利要求 4或 5所述的方法, 其特征在于, 还包括:  6. The method according to claim 4 or 5, further comprising:
统计过滤后的字词记录中的字词在预置的互联网页面数据库中出现的次 数, 得到互联网词频。  The number of times the words in the filtered word records appear in the preset Internet page database, and the Internet word frequency is obtained.
7、 如权利要求 6所述的方法, 其特征在于, 还包括:  7. The method of claim 6, further comprising:
比较该字词的新编码字符串的用户累积词频与原编码字符串的用户累积 词频,才艮据比较结果, 分配其互联网词频至该字词的两个或者多个相应编码字 符串。  Comparing the user cumulative word frequency of the new encoded string of the word with the user accumulated word frequency of the original encoded string, according to the comparison result, the Internet word frequency is assigned to two or more corresponding coded characters of the word.
8、 如权利要求 7所述的方法, 其特征在于, 还包括:  8. The method of claim 7, further comprising:
根据过滤后的字词记录生成新词库或者将过滤后的字词记录添加至原有 词库, 得到新词库或者新版的全词库。 Generate a new lexicon based on the filtered word record or add the filtered word record to the original Thesaurus, get a new thesaurus or a new version of the thesaurus.
9、 如权利要求 8所述的方法, 其特征在于,  9. The method of claim 8 wherein:
所述收集的信息还包括用户所在的区域信息, 将用户划分为若干区域; 针对每个区域进行过滤步骤;  The collected information further includes area information of the user, and divides the user into several areas; and performs a filtering step for each area;
针对每个区域生成区域新词库或者新版的区域全词库。  Generate a regional new thesaurus or a new version of the regional full thesaurus for each region.
10、 如权利要求 6所述的方法, 其特征在于, 通过以下步骤获得预置的互 联网页面数据库:  10. The method of claim 6, wherein the preset Internet page database is obtained by the following steps:
对互联网页面进行权重赋值;  Weighting the Internet page;
将权重值大于或者等于预置阔值的互联网页面存储至互联网页面数据库。  The Internet page whose weight value is greater than or equal to the preset threshold is stored in the Internet page database.
11、 如权利要求 4所述的方法, 其特征在于, 所述收集为: 输入法计算设 备实时或者定时的将用户的具有新编码字符串的字词记录发送至收集计算设 备。 11. The method of claim 4, wherein the collecting is: the input method computing device sends the user's word record with the new encoded character string to the collection computing device in real time or at a time.
12、 一种获取输入法字词的新编码字符串的方法, 其特征在于, 包括: 提取用户在输入过程中所选择的字词, 以及用户输入的编码字符串, 并存 储至用户词库;  12. A method for obtaining a new encoded character string of an input method word, comprising: extracting a word selected by a user during an input process, and an encoded character string input by a user, and storing the encoded character string in the user vocabulary;
收集各个用户的用户词库;  Collect user vocabularies for individual users;
对比所述收集的用户词库和输入法现有词库,所述系统词库中存储有字词 及其相应的编码字符串;  Comparing the collected user vocabulary and the input method existing vocabulary, the system vocabulary stores words and corresponding code strings;
根据预置规则, 确定字词相应的新编码字符串。  Determine the new encoded string corresponding to the word according to the preset rules.
13、 如权利要求 12所述的方法, 其特征在于, 还包括:  13. The method of claim 12, further comprising:
所述用户词库中还包括用户词频,所述用户词频为用户输入该字词及其相 应编码字符串的频率信息;  The user vocabulary further includes a user word frequency, and the user word frequency is frequency information of the user inputting the word and the corresponding encoded character string;
计算用户累积词频;  Calculate the cumulative word frequency of the user;
去除用户累积词频小于或者等于预置阔值的编码字符串。  The encoded string whose user cumulative word frequency is less than or equal to the preset threshold is removed.
14、 如权利要求 13所述的方法, 其特征在于, 所述预置的规则为: 如果用户所选字词在现有词库中存在,但是用户输入的编码字符串与现有 词库中存储的该字词相应的编码字符串不同,则确定用户输入的编码字符串为 该字词相应的新编码字符串; 则进一步比较该字词相应的编码字符串的用户累积词频和系统词频,所述系统 词频为在现有词库中预置的现有字词相应的词频信息,如果用户累积词频与系 统词频的比值大于或者等于预定阔值,则确定用户输入的编码字符串为该字词 相应的新编码字符串。 14. The method according to claim 13, wherein the preset rule is: if the word selected by the user exists in the existing thesaurus, but the encoded string input by the user is in the existing thesaurus. If the stored encoded string of the word is different, it is determined that the encoded string input by the user is a new encoded string corresponding to the word; Further comparing the user cumulative word frequency and the system word frequency of the corresponding encoded string of the word, the system word frequency is the word frequency information corresponding to the existing word preset in the existing thesaurus, if the user accumulates the word frequency and the system word frequency If the ratio is greater than or equal to the predetermined threshold, it is determined that the encoded string input by the user is a new encoded string corresponding to the word.
15、 如权利要求 12或者 14所述的方法, 其特征在于, 还包括: 统计具有新编码字符串的字词在预置的互联网页面数据库中出现的次数, 得到互联网词频。  The method according to claim 12 or 14, further comprising: counting the number of occurrences of the word having the newly encoded character string in the preset Internet page database, and obtaining the Internet word frequency.
16、 如权利要求 15所述的方法, 其特征在于, 还包括:  16. The method of claim 15, further comprising:
比较该字词的新编码字符串的用户累积词频与原编码字符串的用户累积 词频,才艮据比较结果, 分配其互联网词频至该字词的两个或者多个相应编码字 符串。  Comparing the user cumulative word frequency of the new encoded string of the word with the user accumulated word frequency of the original encoded string, according to the comparison result, the Internet word frequency is assigned to two or more corresponding coded characters of the word.
17、 一种输入法系统, 包括输入接口单元、 显示单元以及系统词库, 其特 征在于, 还包括:  17. An input method system comprising an input interface unit, a display unit, and a system vocabulary, the feature comprising:
字词提取单元, 与输入法系统相连, 用于提取用户在输入过程中所选择的 字词, 以及用户输入的编码字符串;  a word extraction unit, connected to the input method system, for extracting words selected by the user during the input process, and an encoded string input by the user;
字词比对单元, 与字词提取单元相连, 用于将用户所选字词、 用户输入的 编码字符串与系统词库进行比对,所述系统词库中存储有字词及其相应的编码 字符串; 根据预置规则, 确定字词相应的新编码字符串。  a word matching unit, connected to the word extracting unit, configured to compare the selected word of the user, the encoded string input by the user, and the system vocabulary, wherein the system vocabulary stores the word and its corresponding Encoded string; Determines the new encoded string corresponding to the word according to the preset rules.
18、 如权利要求 17所述的输入法系统, 其特征在于,  18. The input method system of claim 17 wherein:
所述输入法系统的输入接口单元、显示单元以及系统词库位于同一计算设 备中;  The input interface unit, the display unit, and the system vocabulary of the input method system are located in the same computing device;
或者所述输入法系统的输入接口单元、显示单元位于第一计算设备中, 系 统词库位于第二计算设备中, 所述输入法系统根据用户输入的信息,从位于第 二计算设备中获取相应信息, 在第一计算设备显示相应字符。  Or the input interface unit and the display unit of the input method system are located in the first computing device, and the system vocabulary is located in the second computing device, and the input method system obtains corresponding information from the second computing device according to the information input by the user. Information, the corresponding character is displayed on the first computing device.
19、 如权利要求 17所述的输入法系统, 其特征在于, 还包括:  The input method system of claim 17, further comprising:
通信单元, 用于实时或者定时的发送具有新编码字符串的字词记录, 所述 字词记录包括该字词以及其相应的新编码字符串。  a communication unit for transmitting a word record having a new encoded string in real time or timing, the word record including the word and its corresponding new encoded string.
20、 如权利要求 17所述的输入法系统, 其特征在于, 还包括:  The input method system of claim 17, further comprising:
词频记录单元, 与输入法系统相连, 用于在用户输入过程中, 记录用户词 频, 所述用户词频为用户输入该字词及其相应编码字符串的频率信息; 用户词库, 用于存储用户所选字词、用户输入的编码字符串及其相应的用 户词频。 a word frequency recording unit, connected to the input method system, for recording user words during user input Frequency, the user word frequency is the frequency information of the user inputting the word and its corresponding encoded character string; the user vocabulary is used for storing the word selected by the user, the encoded character string input by the user and the corresponding user word frequency.
21、 如权利要求 17所述的输入法系统, 其特征在于, 还包括:  The input method system of claim 17, further comprising:
应用程序判断单元, 用于判断用户输入的当前应用程序, 并将判断结果发 送至词频记录单元;  An application determining unit, configured to determine a current application input by the user, and send the determination result to the word frequency recording unit;
词频记录单元, 与输入法系统相连, 用于在用户输入过程中, 根据用户输 入的当前应用程序,分别加以相应的权重修正后统计词频信息,得到用户词频。  The word frequency recording unit is connected to the input method system, and is configured to, according to the current application program input by the user, respectively, according to the current application program input by the user, the corresponding word frequency information is corrected, and the word frequency of the user is obtained.
22、 一种词库生成装置, 其特征在于, 包括:  22. A lexicon generating apparatus, comprising:
字词收集单元, 用于收集各个用户的具有新编码字符串的字词记录, 所述 字词记录包括该字词以及其相应的新编码字符串;  a word collecting unit, configured to collect a word record of each user having a new encoded string, the word record including the word and its corresponding new encoded character string;
第一过滤单元, 用于去除重复的字词记录;  a first filtering unit, configured to remove duplicate word records;
词库生成单元,用于根据过滤后的字词记录生成新词库或者将过滤后的字 词记录添加至原有词库, 得到新词库或者新版的全词库。  The thesaurus generating unit is configured to generate a new thesaurus according to the filtered word records or add the filtered word records to the original thesaurus to obtain a new thesaurus or a new version of the full thesaurus.
23、 如权利要求 22所述的装置, 其特征在于, 还包括:  The device of claim 22, further comprising:
词频收集单元, 用于收集用户输入行为中的用户词频, 所述用户词频为用 户输入该字词及其相应编码字符串的频率信息;  a word frequency collecting unit, configured to collect a user word frequency in a user input behavior, where the user word frequency is a frequency information that the user inputs the word and the corresponding encoded character string;
累积词频计算单元, 用于计算用户累积词频;  a cumulative word frequency calculation unit, configured to calculate a cumulative word frequency of the user;
第二过滤单元,用于去除用户累积词频小于或者等于预置阔值的编码字符 串。  And a second filtering unit, configured to remove the encoded character string whose user cumulative word frequency is less than or equal to the preset threshold.
24、 如权利要求 22或者 23所述的装置, 其特征在于, 还包括:  The device according to claim 22 or 23, further comprising:
统计单元,用于统计过滤后的字词记录中的字词在预置的互联网页面数据 库中出现的次数, 得到互联网词频。  A statistical unit is used to count the number of occurrences of words in the filtered word record in the preset Internet page database, and obtain the Internet word frequency.
25、 如权利要求 22所述的装置, 其特征在于, 还包括:  The device of claim 22, further comprising:
词频分配单元:用于比较该字词的新编码字符串的用户累积词频与原编码 字符串的用户累积词频,根据比较结果, 分配其互联网词频至该字词的两个或 者多个相应编码字符串。  Word frequency allocation unit: a user cumulative word frequency for comparing a new encoded character string of the word with a user cumulative word frequency of the original encoded character string, and according to the comparison result, assigning the Internet word frequency to two or more corresponding coded characters of the word string.
26、 一种词库生成装置, 其特征在于, 包括:  26. A lexicon generating apparatus, comprising:
收集单元, 用于收集各用户的输入行为信息, 所述输入行为信息包括用户 选择的字词, 用户输入的编码字符串以及相应的用户词频, 所述用户词频为用 户输入该字词及其相应编码字符串的频率信息; a collecting unit, configured to collect input behavior information of each user, where the input behavior information includes a user a selected word, a coded string input by the user, and a corresponding user word frequency, wherein the user word frequency is frequency information of the user inputting the word and the corresponding encoded character string;
累积词频计算单元,对字词与编码字符串整体相应的各用户词频进行权重 修正, 计算该字词与编码字符串整体用户累积词频;  The cumulative word frequency calculation unit performs weight correction on the word frequency of each user corresponding to the word and the encoded character string, and calculates the cumulative word frequency of the word and the encoded user as a whole;
词库生成单元, 用于生成词库, 所述词库包括字词、 编码字符串及其相应 的词频信息。  The thesaurus generating unit is configured to generate a thesaurus, the thesaurus includes words, encoded strings and corresponding word frequency information.
27、 如权利要求 26所述的装置, 其特征在于, 还包括:  27. The device of claim 26, further comprising:
比对单元, 用于对比所述生成的词库和现有词库, 所述现有词库中存储有 字词、 编码字符串及其相应的系统词频;  a comparison unit, configured to compare the generated thesaurus and the existing thesaurus, wherein the existing thesaurus stores words, encoded strings, and corresponding system word frequencies;
确定单元, 用于根据预置规则, 确定字词相应的新编码字符串。  a determining unit, configured to determine a new encoded string corresponding to the word according to the preset rule.
28、 如权利要求 27所述的装置, 其特征在于, 还包括:  The device of claim 27, further comprising:
过滤单元, 用于去除用户累积词频小于或者等于预置阔值的编码字符串。 The filtering unit is configured to remove the encoded string of the user whose accumulated word frequency is less than or equal to the preset threshold.
29、 如权利要求 27或 28所述的装置, 其特征在于, 还包括: The device according to claim 27 or 28, further comprising:
统计单元,统计具有新编码字符串的字词在预置的互联网页面数据库中出 现的次数, 得到互联网词频;  The statistical unit counts the number of times the word with the new encoded string appears in the preset Internet page database, and obtains the Internet word frequency;
词频分配单元:用于比较该字词的新编码字符串的用户累积词频与原编码 字符串的用户累积词频,根据比较结果, 分配其互联网词频至该字词的两个或 者多个相应编码字符串。  Word frequency allocation unit: a user cumulative word frequency for comparing a new encoded character string of the word with a user cumulative word frequency of the original encoded character string, and according to the comparison result, assigning the Internet word frequency to two or more corresponding coded characters of the word string.
30、 如权利要求 26所述的装置, 其特征在于, 还包括:  30. The device of claim 26, further comprising:
比对单元, 用于对比所述生成的词库和现有词库, 所述现有词库中存储有 字词、 编码字符串及其相应的系统词频;  a comparison unit, configured to compare the generated thesaurus and the existing thesaurus, wherein the existing thesaurus stores words, encoded strings, and corresponding system word frequencies;
过期字词确定单元, 用于确定过期字词; 所述过期字词为在所述生成的词 库中不存在,但是在现有词库中存在的字词, 或者所述过期字词为在所述生成 的词库中用户累积词频符合预置条件的字词。  An expired word determining unit, configured to determine an expired word; the expired word is a word that does not exist in the generated thesaurus, but exists in an existing thesaurus, or the expired word is The words accumulated by the user in the generated thesaurus meet the pre-conditions.
PCT/CN2007/070518 2006-08-23 2007-08-20 Method for obtaining new encode character string, inputting method system and word base generation device WO2008028421A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN200610111562.9 2006-08-23
CNB2006101115629A CN100424703C (en) 2006-08-23 2006-08-23 Method for obtaining newly encoded character string, input method system and word stock generation device

Publications (1)

Publication Number Publication Date
WO2008028421A1 true WO2008028421A1 (en) 2008-03-13

Family

ID=37778551

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2007/070518 WO2008028421A1 (en) 2006-08-23 2007-08-20 Method for obtaining new encode character string, inputting method system and word base generation device

Country Status (2)

Country Link
CN (1) CN100424703C (en)
WO (1) WO2008028421A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109426356A (en) * 2017-09-01 2019-03-05 百度在线网络技术(北京)有限公司 Data inputting method and device

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101286118B (en) * 2007-04-10 2012-04-18 北京搜狗科技发展有限公司 Method for quick calling program instruction, system and an input method system
CN100483416C (en) * 2007-05-22 2009-04-29 北京搜狗科技发展有限公司 Character input method, input method system and method for updating word stock
CN100483417C (en) * 2007-05-25 2009-04-29 北京搜狗科技发展有限公司 Method for catching limit word information, optimizing output and input method system
CN101727271B (en) * 2008-10-22 2012-11-14 北京搜狗科技发展有限公司 Method and device for providing error correcting prompt and input method system
CN103064825B (en) * 2011-10-18 2016-03-02 阿里巴巴集团控股有限公司 Fuzzy phoneme is to foundation, method to set up and input method and device thereof and system
CN103870001B (en) * 2012-12-11 2018-07-10 百度国际科技(深圳)有限公司 A kind of method and electronic device for generating candidates of input method
CN103869998B (en) * 2012-12-11 2018-05-01 百度国际科技(深圳)有限公司 A kind of method and device being ranked up to candidate item caused by input method
CN103870000B (en) * 2012-12-11 2018-12-14 百度国际科技(深圳)有限公司 The method and device that candidate item caused by a kind of pair of input method is ranked up
CN105892710B (en) * 2015-01-20 2021-10-15 张龙哺 Chinese character input method and device based on text box
CN106249914A (en) * 2016-08-03 2016-12-21 太仓美宅姬娱乐传媒有限公司 A kind of character input method and system thereof
CN107247519B (en) * 2017-08-16 2020-09-29 北京搜狗科技发展有限公司 Input method and device

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1494025A (en) * 2002-10-31 2004-05-05 英业达股份有限公司 Input method of Chinese character having classification thesaurus and its system
US20050251384A1 (en) * 2004-05-05 2005-11-10 Microsoft Corporation Word extraction method and system for use in word-breaking

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060048055A1 (en) * 2004-08-25 2006-03-02 Jun Wu Fault-tolerant romanized input method for non-roman characters

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1494025A (en) * 2002-10-31 2004-05-05 英业达股份有限公司 Input method of Chinese character having classification thesaurus and its system
US20050251384A1 (en) * 2004-05-05 2005-11-10 Microsoft Corporation Word extraction method and system for use in word-breaking

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109426356A (en) * 2017-09-01 2019-03-05 百度在线网络技术(北京)有限公司 Data inputting method and device
CN109426356B (en) * 2017-09-01 2022-07-15 百度在线网络技术(北京)有限公司 Information input method and device

Also Published As

Publication number Publication date
CN100424703C (en) 2008-10-08
CN1920827A (en) 2007-02-28

Similar Documents

Publication Publication Date Title
WO2008028421A1 (en) Method for obtaining new encode character string, inputting method system and word base generation device
CN108304375B (en) Information identification method and equipment, storage medium and terminal thereof
US10176803B2 (en) Updating population language models based on changes made by user clusters
JP6675463B2 (en) Bidirectional stochastic rewriting and selection of natural language
US7983902B2 (en) Domain dictionary creation by detection of new topic words using divergence value comparison
US8463598B2 (en) Word detection
WO2020253350A1 (en) Network content publication auditing method and apparatus, computer device and storage medium
JP6813591B2 (en) Modeling device, text search device, model creation method, text search method, and program
KR101498001B1 (en) Selecting high quality reviews for display
WO2008014702A1 (en) Method and system of extracting new words
CN101208689B (en) Method and apparatus for creating a language model and kana-kanji conversion
US10162812B2 (en) Natural language processing system to analyze mobile application feedback
WO2007143914A1 (en) Method, device and inputting system for creating word frequency database based on web information
WO2007114563A1 (en) System and method for providing recommended word of adjustment each user and computer readable recording medium recording program for implementing the method
CN106528532A (en) Text error correction method and device and terminal
CN108124477A (en) Segmenter is improved based on pseudo- data to handle natural language
WO2009026850A1 (en) Domain dictionary creation
CN111767393A (en) Text core content extraction method and device
JP6830226B2 (en) Paraphrase identification method, paraphrase identification device and paraphrase identification program
CN111488453B (en) Resource grading method, device, equipment and storage medium
JP5302614B2 (en) Facility related information search database formation method and facility related information search system
CN103064967B (en) A kind of method and apparatus for establishing user's binary crelation library
WO2010142422A1 (en) A method for inter-lingual electronic communication
JP2009163358A (en) Information processor, information processing method, program, and voice chat system
KR20020084302A (en) Apparatus of extract and transmission of image using the character message, its method

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 07800994

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

NENP Non-entry into the national phase

Ref country code: RU

122 Ep: pct application non-entry in european phase

Ref document number: 07800994

Country of ref document: EP

Kind code of ref document: A1