WO2008028421A1

WO2008028421A1 - Method for obtaining new encode character string, inputting method system and word base generation device

Info

Publication number: WO2008028421A1
Application number: PCT/CN2007/070518
Authority: WO
Inventors: Qi Guo; Zijian Tong; Lei Yang
Original assignee: Beijing Sogou Technology Development Co., Ltd.
Priority date: 2006-08-23
Filing date: 2007-08-20
Publication date: 2008-03-13
Also published as: CN100424703C; CN1920827A

Abstract

A method for obtaining the new encode character string of the inputting method word comprises: extracting the word selected by the user in the input process, and the encode character string inputted by the user; comparing the word selected by the user and encode character string inputted by the user with the present word base, wherein the present word base stores the present word and the encode character string corresponding to the present word; determining the new encode character string corresponding to the word according the preset rule.

Description

Method for obtaining new coded string and input method system and vocabulary generating device The present application claims to be submitted to the Chinese Patent Office on August 23, 2006, the application number is 200610111562.9, and the invention name is "Method and input method for obtaining a new code string" The priority of the Chinese patent application of the system, the lexicon generating device, the entire contents of which are hereby incorporated by reference. Technical field

The present invention relates to the field of input methods, and in particular, to a method for acquiring a new encoded character string of an input method word, an input method system, and a thesaurus generating device.

Background technique

The existing input method system is based on the input string of the user input, and matches the words required by the user, for example, Chinese, Japanese, and Korean input method systems. In the system vocabulary of the existing input method, a corresponding code string is set for each word, and the user can only obtain the desired word by inputting the correct code string.

However, the user has a learning process for the correct encoded string. It is difficult to ensure that the correspondence between all encoded strings and words recognized by the user is correct, so the existing input method system improves fault tolerance and satisfies some users. For the habit of encoding strings, a solution to fuzzy sounds is proposed, for example, z = zh, s = sh, in = ing, and so on. The fuzzy sound solution can solve some of the problems because of the North and South languages, but because each region has its own dialect (especially for the dialects with a large number of dialects in Chinese), so when users use the phonetic code to input words, more or less input code There are times when there are less accurate problems, and the above fuzzy sound solution does not solve all the problems. For example, the word "folded", some users are accustomed to input "shele", some users are accustomed to input "zhele"; "fall", some cases need to enter "laxia,,, some cases need to input" luoxia "Hupai" and "hepai" corresponding to the word "和牌"; these cannot be solved by means of fuzzy sounds. It is impossible to know all the dialect habits in the input method system lexicon, so the user needs to learn from the candidate multiple times. The position behind the word is sorted to select the desired word, which seriously affects the user's input speed.

Therefore, how to get as much as possible to know the user's dialect idioms and improve the hit rate of the preferred words in the input method system at this time has become one of the technical problems urgently needed to be solved by those skilled in the art.

Summary of the invention

The technical problem to be solved by the present invention is to provide a new encoded string for obtaining input method words. The method and device can obtain a new code string used by each user, and aggregate the generated lexicon, thereby satisfying the idiom of the user's new code string and improving the hit rate of the user's preferred word.

Another object of the present invention is to provide an input method system, which can automatically and automatically acquire the encoded character string used by the user for some words in a simple, convenient, timely and effective manner, and obtain a new coded string used by each user by comparison. .

Another object of the present invention is to provide a thesaurus generating apparatus capable of efficiently providing a relatively complete whole dictionary or a new thesaurus including a new encoded character string suitable for user input habits.

In order to solve the above technical problem, the present invention provides a method for obtaining a new encoded character string of an input method word, which comprises: extracting a word selected by a user in an input process, and an encoded character string input by a user; The selected word and the encoded string input by the user are compared with the existing thesaurus, wherein the existing thesaurus stores the existing words and their corresponding encoded strings; according to the preset rules, the corresponding words are determined. New encoded string.

Preferably, the method further includes: recording a word selected by the user and a code string input by the user to the user vocabulary; and during the user input process, recording the user word frequency to the user vocabulary, wherein the user word frequency is The user enters frequency information for the word and its corresponding encoded string.

Preferably, the method further includes: according to the current application input by the user, respectively, the corresponding weight corrected statistical word frequency information is used to obtain the user word frequency.

Preferably, the method further includes: collecting a word record of each user having a new encoded character string, the record including the word, the corresponding new coded string, and corresponding word frequency information; removing duplicate words recording.

Preferably, the method further includes: calculating a cumulative word frequency of the user; and removing the encoded character string whose user cumulative word frequency is less than or equal to the preset threshold.

Preferably, the method further includes: counting the number of occurrences of the words in the filtered word record in the preset Internet page database, and obtaining the Internet word frequency.

Preferably, the method further includes: comparing a user cumulative word frequency of the new encoded character string of the word with a user cumulative word frequency of the original encoded character string, and according to the comparison result, allocating the Internet word frequency to the two words of the word or Multiple corresponding encoding strings.

Preferably, the method further includes: generating a new thesaurus according to the filtered word records or adding the filtered word records to the original thesaurus to obtain a new thesaurus or a new version of the full thesaurus. The collected information further includes area information of the user, and divides the user into several areas; performs a filtering step for each area; generates a new regional vocabulary or a new version of the regional full vocabulary for each area.

Preferably, the preset Internet page database is obtained by the following steps: weighting the Internet page; and storing the Internet page whose weight value is greater than or equal to the preset threshold to the Internet page database.

The collection is: the input method computing device sends the user's word record with the new encoded string to the collection computing device in real time or at a time.

The present invention also provides a method for obtaining a new encoded character string of an input method word, comprising: extracting a word selected by a user in an input process, and a coded string input by a user, and storing the same in a user vocabulary; User vocabulary of each user; comparing the collected user vocabulary and the input vocabulary of the input method, the system vocabulary stores the words and their corresponding encoded strings; according to the preset rules, determining the corresponding words The new encoded string.

Preferably, the method further includes: the user vocabulary further includes a user word frequency, wherein the user word frequency is frequency information of the user inputting the word and the corresponding encoded character string; calculating a cumulative word frequency of the user; An encoded string whose word frequency is less than or equal to the preset threshold.

The preset rule is:

If the word selected by the user exists in the existing thesaurus, but the encoded string input by the user is different from the encoded string corresponding to the word stored in the existing thesaurus, it is determined that the encoded string input by the user is the word a new coded string corresponding to the word; further comparing the user cumulative word frequency and the system word frequency of the corresponding encoded string of the word, the system word frequency being the word frequency information corresponding to the existing word preset in the existing thesaurus, If the ratio of the cumulative word frequency of the user to the system word frequency is greater than or equal to the predetermined threshold, it is determined that the encoded character string input by the user is a new encoded character string corresponding to the word.

Preferably, the method further includes: counting the number of occurrences of the word with the newly encoded character string in the preset internet page database, and obtaining the internet word frequency.

Preferably, the method further includes: comparing a user cumulative word frequency of the new encoded character string of the word with a cumulative word frequency of the original encoded character string, and assigning the Internet word frequency to the word according to the comparison result Two or more corresponding encoded strings of words.

The invention also provides an input method system, comprising an input interface unit, a display unit and a system vocabulary, further comprising: a word extraction unit, connected to the input method system, for extracting words selected by the user during the input process And a coded string input by the user; the word matching unit is connected to the word extracting unit, and is configured to compare the selected word of the user, the encoded string input by the user, and the system vocabulary, the system lexicon The word and its corresponding encoded string are stored; according to the preset rule, the new encoded string corresponding to the word is determined.

Preferably, the input interface unit, the display unit, and the system vocabulary of the input method system are located in the same computing device; or the input interface unit and the display unit of the input method system are located in the first computing device, and the system vocabulary is located at the first In the second computing device, the input method system obtains corresponding information from the second computing device according to the information input by the user, and displays the corresponding character in the first computing device.

Preferably, the input method system further includes: a communication unit, configured to send a word record with a new encoded string in real time or timing, the word record including the word and its corresponding new encoded string .

Preferably, the input method system further includes: a word frequency recording unit, connected to the input method system, configured to record a user word frequency in a user input process, wherein the user word frequency is a user inputting the word and a corresponding coded character thereof The frequency information of the string; the user vocabulary, used to store the words selected by the user, the encoded string input by the user, and the corresponding user word frequency.

Preferably, the input method system further includes: an application determining unit, configured to determine a current application input by the user, and send the determination result to the word frequency recording unit; the word frequency recording unit is connected to the input method system, and is configured to In the user input process, according to the current application input by the user, the corresponding frequency is corrected and the word frequency information is obtained, and the user word frequency is obtained.

The present invention also provides a thesaurus generating apparatus, comprising: a word collecting unit, configured to collect a word record of each user having a new encoded character string, the word record including the word and its corresponding new code a string; a first filtering unit, configured to remove duplicate word records; a thesaurus generating unit, configured to generate a new thesaurus according to the filtered word records or add the filtered word records to the original thesaurus, Get a new thesaurus or a new version of the thesaurus.

Preferably, the device further includes: a word frequency collecting unit, configured to collect a user word frequency in a user input behavior, where the user word frequency is a frequency signal that the user inputs the word and the corresponding encoded character string The accumulated word frequency calculation unit is configured to calculate a cumulative word frequency of the user; and the second filtering unit is configured to remove the encoded character string whose user cumulative word frequency is less than or equal to the preset threshold.

Preferably, the device further includes: a statistical unit, configured to count the number of occurrences of the words in the filtered word record in the preset Internet page database, and obtain the Internet word frequency.

Preferably, the device further includes: a word frequency allocation unit: a user cumulative word frequency for comparing the new word frequency of the word with the original code string of the original code string, and according to the comparison result, assigning the Internet word frequency to the Two or more corresponding encoded strings of words.

The invention also provides a thesaurus generating device, comprising:

a collecting unit, configured to collect input behavior information of each user, where the input behavior information includes a word selected by the user, a coded string input by the user, and a corresponding user word frequency, where the user word frequency is the user inputting the word and corresponding The frequency information of the encoded string;

The cumulative word frequency calculation unit performs weight correction on the word frequency of each user corresponding to the word and the encoded character string, and calculates the cumulative word frequency of the word and the encoded user as a whole;

A thesaurus generating unit, the thesaurus includes words, encoded strings, and corresponding word frequency information. Preferably, the device further includes: a comparison unit, configured to compare the generated thesaurus and the existing thesaurus, wherein the existing thesaurus stores words, encoded strings and corresponding systems thereof a word frequency; a determining unit, configured to determine a new encoded string corresponding to the word according to a preset rule.

Preferably, the device further includes: a filtering unit, configured to remove an encoded character string whose user accumulated word frequency is less than or equal to a preset threshold. a statistical unit that counts the number of occurrences of a word with a new encoded string in a preset Internet page database, and obtains an Internet word frequency; a word frequency allocation unit: a user cumulative word frequency and original code for comparing the new encoded character string of the word The user of the string accumulates the word frequency, and according to the comparison result, allocates the Internet word frequency to two or more corresponding encoded strings of the word.

Or, preferably, the device further includes: a comparison unit, configured to compare the generated thesaurus and the existing thesaurus, wherein the existing thesaurus stores the words, the encoded strings, and corresponding a system word frequency; a determining unit, configured to determine an expired word; the expired word is a word that does not exist in the generated thesaurus, but a word existing in the existing thesaurus, or the expired word is In the generated thesaurus, the user accumulates words whose word frequency meets the preset condition.

Compared with the prior art, the present invention has the following advantages:

First, the present invention proposes a distributed architecture, including multiple clients and a collector, through Extracting the words and encoded strings input by the user on the user side, and comparing with the existing thesaurus to learn a new encoded string suitable for the user's usage habit; then collecting and summarizing the new encoded strings of each user and their corresponding words The word, analysis and filtering can obtain a new code string with universal significance; the invention provides a solution from the perspective of user input, and can timely and comprehensively learn the new code string used by the user in the input process, including A new encoded string that reflects the user's dialect habits, as well as new, unimaginable, but often used, new encoded strings that improve the accuracy of preferred words.

Secondly, the present invention places the obtained new encoded character string and its words into a selected Internet page database, and counts the number of occurrences thereof to obtain the Internet word frequency; and according to the user word frequency on the new and old encoded string of the word Distribution, after the Internet word frequency is corrected and given to the old and new coded strings, the most scientific word frequency results can be obtained, thereby avoiding the input efficiency and input experience of other normal users due to the usage habits of some users.

Finally, the present invention can also be used to collect only new coded strings of users who count a certain area, and obtain the language habits or coding habits of the users in the area, thereby providing an input method system of different pronunciations or coded versions of each area or The input method system allows the user to set the desired area of the area in which they are located.

DRAWINGS

1 is a flow chart showing the steps of a preferred embodiment of the method for obtaining a new encoded character string of an input method word;

2 is a flow chart showing the steps of another method for obtaining a new encoded character string of an input method word; FIG. 3 is a structural block diagram of the input method system;

4 is a structural block diagram of the thesaurus generating apparatus;

Figure 5 is a block diagram showing the structure of a thesaurus generating apparatus for determining a new encoded character string;

Fig. 6 is a block diagram showing the structure of a thesaurus generating apparatus for determining expired words.

detailed description

The present invention will be further described in detail with reference to the accompanying drawings and specific embodiments.

Referring to FIG. 1, an excellent method for obtaining a new encoded character string of an input method word according to the present invention is shown in FIG. The flow chart of the steps of the selected embodiment includes the following steps:

Step 101: In the user input process, extract the word selected by the user and the code string input by the user.

Step 101 can be completed by the input method system, and the input method system can extract the words selected by the user and the encoded string input by the user in any feasible manner during the user input process. The extracted information can be directly performed in step 102, or stored in the user vocabulary, and the user vocabulary is compared with the system vocabulary after a certain time interval.

For a language that needs to input text by encoding, the user needs to input the encoded string and select the desired word among the candidate words to complete the input. Step 101 is to record the word selected by the user who inputs the information of the user and the encoded string of the input. The encoded character string may be a pinyin code or a font code, that is, the present invention can be applied to various input methods, and of course, the Chinese input method suitable for inputting and using the code is preferably used.

The words selected by the user will include some words that the user is used to dialing the phonetic code in the dialect, for example: "Fold,,, the user enters the correct encoding string - - "zhele"; but the input method is original The word string corresponding to the word in the lexicon is "shele", so it cannot be directly displayed to the user in the candidate word, and the user needs to select each word to obtain the desired vocabulary. There are many such words, for example, Turning heads "diaotou,," "tiaotou"; urinary "niaoniao,," "niaosui", etc., there are many cases that we can't count. With this invention, we can find such new coded strings as soon as possible, thus Improve the accuracy of preferred words in user input.

Furthermore, the user can also create some new coded characters corresponding to the words that are not used in the original thesaurus by the artificial word-making function provided by the input method (for example, Microsoft Pinyin input method or double spell input method). String, so that the user can select the desired word during the input process. For example, the word "Traditional" is a place name in Shanxi. In the input method, the corresponding encoded string is "fanshi". The word "峙" in the input method generally corresponds to the code string "shi, zhi". Two, but the locals in the area are generally accustomed to using "fansi" to identify "Traditional,,, but in the existing input method", this word generally does not have a corresponding encoding such as "si", so the user can manually The word-making function realizes the correspondence between "峙,, and "si", or the correspondence between "Traditional" and "fani". The present invention can also select the user from the words and the encoded strings selected by the user. A self-created encoded string for the word.

Step 102: Compare the selected word of the user and the encoded string input by the user with the existing thesaurus. The specification uses the system vocabulary agent to represent the existing vocabulary, because the existing system vocabulary stores the existing words and their corresponding code strings.

Step 103: If the word selected by the user exists in the system vocabulary, but the encoded string input by the user is different from the encoded string corresponding to the word stored in the system vocabulary, determining that the encoded string input by the user is The new encoded string corresponding to the word.

Through the above steps 101-103, the user's coding habit can be automatically and simply obtained. Then, after collecting the new coded strings of multiple users and their corresponding words in various ways, and removing the filtering steps such as repeated word records, a new coded string in a general sense can be obtained.

The collecting may be: the input method user computing device sends the new code string of the user and its corresponding word to the word collection computing device in real time or timing, that is, the input method computing device has an automatic sending Module. Preferably, the collection computing device exists in the form of a server.

The collecting may also send the new code string and its corresponding word to the collecting end for the input method user periodically or irregularly, that is, the sending is manually initiated by the user, for example, each user will have his own new code. Strings and their corresponding words are sent to a unified email address or unified server for collection. In the case of the thesaurus, the vocabulary storing the user's personal words can be sent to the collection computing device in real time or periodically. For example, each user can collect the lexicon by backing up the server periodically or irregularly.

Furthermore, for the network input method (provided only to the user input interface and display interface, through the connection server to complete the entire input process), the collection of the user's new code string and its corresponding words is simpler, because this The input method system used by the user is itself a server, which can be used by multiple users, and the input behavior information of each user can be collected during use.

In fact, the present invention is feasible in any way that enables information collection, and is no longer - an illustration.

In order to achieve the best results, Fig. 1 shows an embodiment which is preferably in the above steps. In the preferred embodiment shown in FIG. 1, step 101 further includes: recording user word frequency to the user during user input The household vocabulary includes a plurality of word records, the word records including the words, corresponding new encoded strings, and corresponding user word frequencies. Preferably, the process of collecting the user word frequency in step 101 may be: according to the current application input by the user, respectively, the corresponding word weight information is corrected by the corresponding weight, and the user word frequency is obtained.

In the preferred embodiment shown in FIG. 1, the method further includes:

Step 104: Collect a word record of each user with a new code string, where the record includes the word, a corresponding new code string, and a corresponding user word frequency;

Step 105: Remove duplicate word records.

Step 106: Calculate a cumulative word frequency of the user corresponding to the encoded string;

The calculation process of the accumulated word frequency of the user can obtain the cumulative word frequency of the user after collecting the summarized words by simply superimposing the user word frequency of each user.

The calculation process of the accumulated word frequency of the user may also perform weight correction on each word frequency of each word corresponding to the word, and calculate the cumulative word frequency of the user of each word; the weight correction may be performed by analyzing the word frequency of each user corresponding to a certain word. After completion, for example, firstly, the word frequency of each user corresponding to the word is analyzed, and the distribution trend is found, and the probability of occurrence of a certain word frequency value or the frequency value of the word frequency is corrected from the average range. The cumulative word frequency calculated by the above-mentioned correction can remove some users' accidental behavior or malicious behavior, and obtain objective and accurate user cumulative word frequency, thereby ensuring the accuracy of the thesaurus.

Step 107: Remove the encoded character string whose user cumulative word frequency is less than or equal to the preset threshold. This step can remove the special input habits of some individual users who do not have universal significance, and can guarantee the objectivity and accuracy of the new encoded string obtained.

Step 108: Count the number of times the words in the filtered word record appear in the preset Internet page database, and obtain the Internet word frequency.

The order of the above steps 105-108 is not limited, and there is no strict sequence between the steps. Therefore, the order of the above steps 105-108 is merely an indication, and those skilled in the art can adjust themselves according to needs, and The core concept of the invention is not affected.

The step 108 may further include a weight assignment step: assigning a weight to the Internet page; and storing the Internet page whose weight value is greater than or equal to the preset threshold to the Internet page database. The process is an optional step, the purpose of which is to obtain a database of selected Internet pages. To ensure the accuracy of the screening of new words. Of course, other methods can also be used to form a preset Internet page database.

In the step of weighting, it is a relatively important situation to assign a corresponding weight value according to the time formed by the webpage and the type of the webpage. Because for Internet word frequency statistics, the impact of web page time is very important, so the impact of web page time on the weight value is greater. The farther the time point from the word frequency statistics is, the lower the weight value is. If the time difference is greater than certain The value of the page can give the page a lower weight value, even excluded from the word frequency statistics. Secondly, the type of webpage has a great influence on the word frequency statistics. The webpage type generally refers to a portal website, a forum or some other determined webpages. The weight value of these webpages is higher because there are more participants and information in these webpages. Update the latest trends that are faster and better able to respond to word frequency. For the determination of the webpage type, a rule base can be set, and the URL addresses of some webpages are stored in the library, thereby determining that the webpages of these URLs are more important for word frequency statistics, and the words appearing on these webpages are preferred. For statistical purposes, the web page is given a greater weight value.

Secondly, the present invention can further remove some duplicate web pages, yellow web pages and spam web pages by giving lower weight values, thereby further ensuring the accuracy of new word verification.

After obtaining the Internet word frequency of the word through step 108, the vocabulary can be set to include the word, the corresponding Internet word frequency of the word, and the corresponding user cumulative word frequency of the word. For example, the word "heavy" has an Internet word frequency and two user cumulative word frequencies in the thesaurus, corresponding to "chongchong""zhongzhong" _0. Internet word frequency can improve the accuracy of word frequency, but because words cannot be found on the Internet. Reflecting the encoded string, the user can accumulate the word frequency to reflect the user's input habits and improve the hit rate of the preferred word.

Of course, step 108 may not be needed, and the vocabulary may be included to include the word, the original word frequency of the word, and the corresponding cumulative word frequency of the word.

The above-mentioned one word corresponds to two word frequency. The use process is complicated, and two types of word frequency data are needed to achieve the best effect. For further simplification, the preferred embodiment shown in FIG. 1 may further include step 109, the above two types. The word frequency data is adjusted to a word frequency data.

Step 109: Allocate two or more corresponding encoded character strings of the Internet word frequency to the word according to the ratio of the user cumulative word frequency of the new encoded character string of the word to the cumulative word frequency of the original encoded character string. That is to say, the words appearing in the Internet correspond to two or more corresponding encoded character strings, and according to the difference of the cumulative word frequency of the input character string input by the user, the Internet word frequency reflecting the total word frequency of the word is allocated to the two words of the word. One or more corresponding encoding strings, thereby objectively and accurately reflecting the user's input habits, and improving the accuracy of the preferred words in the user input process.

Of course, step 109 only gives an example in the allocation of Internet word frequency. In practical applications, there are many ways to compare the original code word frequency and the new code word frequency in the word frequency allocation after Internet verification, for example, linear, Non-linear, smooth adjustment, etc., then calculate a ratio, and then allocate its Internet word frequency to two or more corresponding encoded strings of the word, which will not be detailed here.

Step 1010: Generate a new thesaurus according to the filtered word records or add the filtered word records to the original thesaurus to obtain a new thesaurus or a new version of the full thesaurus. The word record includes the word, the corresponding new encoded string, and corresponding word frequency information.

The embodiment shown in Fig. 1 can be used to collect new code strings for users nationwide, and then to derive a new thesaurus or a new version of the full thesaurus suitable for most people, thereby improving the input experience of users in various regions.

The embodiment shown in FIG. 1 can also be used in the following cases: the collected new user-encoded character string is still collected, and the collected information further includes the area information of the user, and the user is divided into several areas; The area performs the filtering step; a new regional vocabulary or a new version of the regional vocabulary is generated for each region. That is, the different pronunciations of people in each area can be separately counted, the input method system of different coding versions of each area can be provided, or the user can set the area where the user is located in the input method system, thereby more personalizedly satisfying the coding of users in each area. Input habits.

The new thesaurus obtained in the above steps or the new version of the full thesaurus can be used to update the input method.

For example, for updating the normal input method: setting an input method system including a system vocabulary in the first computing device, and obtaining a new vocabulary or a new version of the full vocabulary in the second computing device; The system connects the second computing device to the second computing device to complete the update of the system vocabulary.

The stored new vocabulary or the new version of the second vocabulary second computing device may exist in the network in the form of a server, and provide a vocabulary update service to any other client program that needs to input the new vocabulary information. Of course, it does not need to be in the form of a fixed server. It can also exist in In a local computing device, the vocabulary update service is provided to any client terminal of other terminals that needs to input the new word information through P2P (peer-to-peer) technology.

In the above updated embodiment, the updating may be performed by: updating the system vocabulary at the same time when the input method system is updated; or performing online update of the system vocabulary by means of the server actively pushing; or, by the user The request is initiated, and the server returns data according to the request to update the system vocabulary. Of course, you can also use the way of mobile memory update or version update. In short, various data update methods may be used, and the present invention is not limited thereto, and those skilled in the art may select them according to needs.

For another example, for updating the network input method: setting a unit for receiving user input information and displaying corresponding characters in the input method system is located in the first computing device; the obtained new thesaurus or the new version of the full thesaurus is the input method system a system vocabulary, the system vocabulary is located in a second computing device; the input method system obtains corresponding information from a system vocabulary located in the second computing device according to information input by the user, and displays corresponding characters in the first computing device , complete the text input.

In the above example, the new thesaurus or the new version of the whole thesaurus obtained according to the present invention can be directly used as the system vocabulary of the input method system, and the online thesaurus can be used without updating operations. Wherein, the input method system is divided into two parts, the receiving and displaying unit is located in the first computing device, and the thesaurus information is located in the second computing device, which can perfectly implement the online application of the input method; of course, the encoding required for the input method system The matching process can be arbitrarily set in a computing device as needed.

Referring to FIG. 2, a flow chart of another method for obtaining a new encoded character string of an input method word according to the present invention includes the following steps:

Step 201: Extract a word selected by the user in the input process, and a coded character string input by the user, and store the coded character string in the user vocabulary;

Step 202: Collect a user vocabulary of each user;

Step 203: Compare the collected user vocabulary and the input method system vocabulary, where the system vocabulary stores words and corresponding code strings;

Step 204: Determine, according to the preset rule, a new coded string corresponding to the word.

The preset rule may be: if the word selected by the user exists in the system lexicon, but the encoded string input by the user is different from the encoded string corresponding to the word stored in the system vocabulary, then the user input is determined. The encoded string is the new encoded string corresponding to the word. Or the preset rule may be: if the selected word of the user and the encoded string input by the user exist in the existing thesaurus, the user cumulative word frequency and system for further comparing the corresponding encoded string of the word Word frequency, the word frequency of the system is the word frequency information corresponding to the existing words preset in the existing lexicon. If the ratio of the cumulative word frequency of the user to the system word frequency is greater than or equal to the predetermined threshold, the code string input by the user is determined to be The new encoded string corresponding to the word.

Those skilled in the art can also use the above preset rules in combination, and can also set rules according to the needs, and the present invention is not limited thereto.

The embodiment shown in FIG. 2 is basically similar to the concept of the embodiment shown in FIG. 1. The main difference is that the user vocabulary of multiple users is collected first, and then the comparison is performed uniformly, and the new encoded character string of the user is obtained according to the comparison result; This method can reduce the number of comparison calculations, and can reduce the burden of the local input method system. It can be directly used in the existing input method system, but the comparison will be performed after a large number of user-selected words are collected, and the server will be added. System burden. For both, those skilled in the art can select and use as needed.

Preferably, the embodiment shown in FIG. 2 may further include a filtering step: wherein the user vocabulary calculates a user cumulative word frequency corresponding to the encoded character string; and removes the encoded character string whose user cumulative word frequency is less than or equal to the preset threshold.

Preferably, the embodiment shown in FIG. 2 may further include a word frequency giving step: counting the number of occurrences of the word with the new encoded character string in the preset Internet page database, obtaining the Internet word frequency; comparing the new coded characters of the word The user of the string accumulates the word frequency and the user's accumulated word frequency of the original encoded string, and according to the comparison result, the Internet word frequency is allocated to two or more corresponding encoded strings of the word.

Referring to FIG. 3, it is a structural block diagram of an input method system according to the present invention, which includes an input interface unit 301, a display unit 302, and a system vocabulary 303. The input method system further includes:

a word extracting unit 304, connected to the input method system, for extracting a word selected by the user and a coded string input by the user during the user input process;

The word matching unit 305 is connected to the word extracting unit 304, and is configured to compare the selected word of the user, the encoded string input by the user, and the system vocabulary, wherein the system vocabulary stores the word and Corresponding encoded string; According to the preset rules, the new encoded string corresponding to the word is determined.

The preset rule may be: if the user selected word exists in the system vocabulary, but the user loses If the encoded string entered is different from the encoded string corresponding to the word stored in the system vocabulary, it is determined that the encoded string input by the user is a new encoded string corresponding to the word.

That is, the above input method system can be used to extract a new code string of the user in addition to the ordinary word input. The input method system may be a common input method system. For example, the input interface unit 301, the display unit 302, and the system vocabulary 303 of the input method system are located in the same computing device, and the input method system is based on the encoded information input by the user. The corresponding characters are displayed locally by local query matching. The input method system may also be a network input method system. For example, the input interface unit 301 and the display unit 302 of the input method system are located in a first computing device, and the system vocabulary 303 is located in a second computing device. The system obtains corresponding information from the second computing device according to the information input by the user, and displays the corresponding character in the first computing device.

In order to be able to send the extracted new encoded character string of the user to the collecting device, thereby obtaining a new encoded character string in a general sense, the input method system preferably further comprises: a communication unit

306. Send a word record with a new encoded string in real time or timing, the word record including the word and its corresponding new encoded string.

In order to filter the new coded string of each user by the user word frequency to obtain an objective and correct result, the input method system preferably further includes: a word frequency recording unit 307 connected to the input method system for inputting at the user In the process, the user word frequency is recorded, and the user word frequency is frequency information corresponding to the code string input by the user; the user vocabulary 308 is configured to store the word selected by the user, the code string input by the user, and the corresponding user. Word frequency.

The input interface unit 301 in the above input method system is most important for providing the user with information input and word selection; and can also be used for switching various modes, for example: input language switching ( Such as: Simplified and Traditional Chinese, Chinese and English switching), input mode switching (such as: single-word input, word input, sentence input switching), input state switching (such as: text, punctuation, special symbol switching) and so on. The display unit 302 and the system vocabulary 303 are all well known to those skilled in the art and will not be described in detail herein.

The input method system shown in FIG. 3 may further include: an application determining unit 309, configured to determine a current application input by the user, and send the determination result to the word frequency recording unit 307; the word frequency recording unit 307 is configured to During the user input process, according to the current application input by the user, the word frequency information is separately counted, and the corresponding weights of the texts are corrected to form a user word frequency. That is, the input method system can respectively perform the corresponding weight assignment and the statistic word frequency according to the current application input by the user. For example, since the preferred method of the present invention can statistically obtain the Internet word frequency, the value is calculated; and the user inputs in the online community forum. Words, because they can be counted from the Internet, can be given a relatively low weight value.

4 is a structural block diagram of a thesaurus generating apparatus of the present invention, which includes the following components: a collecting unit 401, configured to collect a word record of each user having a new encoded character string, the word record including the word And its corresponding new encoded string.

The vocabulary generating means can be implemented by a server, and the collecting can be implemented by various methods as described above. The user's word record with the new encoded character string can be obtained by the input method and automatically sent to the collecting unit 401; or can be set or organized by the user and sent to the collecting unit 401; or each user has a new encoding The word records of the string are collected into a fixed network space, and the collecting unit 401 acquires a word record of each user having a new encoded character string from the network space. That is, the word record of the user having the newly encoded character string in the embodiment is not necessarily obtained by the user input behavior, and may be set or organized by the user.

a first filtering unit 402, configured to remove duplicate word records;

The thesaurus generating unit 403 is configured to generate a new thesaurus according to the filtered word records or add the filtered word records to the original thesaurus to obtain a new thesaurus or a new version of the full thesaurus.

In order to obtain an objective and correct result, the vocabulary generating device is further configured to: collect the user word frequency in the user input behavior, the user The word frequency is the frequency information corresponding to the encoded character string input by the user; the cumulative word frequency calculation unit 404 is configured to calculate the user cumulative word frequency corresponding to the encoded character string; and the second filtering unit 405 is configured to remove the user cumulative word frequency less than or equal to the preset. A wide-valued encoded string. Wherein, the statistics of the user's word frequency, preferably, may also be respectively based on the current application input by the user, and the corresponding weights are used.

In order to provide corresponding and relatively accurate word frequency information of the above new code string, the vocabulary generating device preferably further includes:

The Internet page database generating unit 406 is configured to perform weight assignment on the Internet page; and store the Internet page whose weight value is greater than or equal to the preset threshold to the Internet page database. The statistic unit 407 is configured to count the number of occurrences of the words in the filtered word record in the preset Internet page database, and obtain the Internet word frequency.

a word frequency allocation unit 408: a user cumulative word frequency for comparing a new encoded character string of the word with a user cumulative word frequency of the original encoded character string, and according to the comparison result, assigning the Internet word frequency to two or more corresponding codes of the word String. The user accumulated word frequency corresponding to the original encoded character string may be obtained by other means, or in the collecting unit 401, the original encoded character string of the word and the corresponding user word frequency information may be simultaneously collected, for each user. The user word frequency is calculated to obtain the user cumulative word frequency.

Referring to FIG. 5, the present invention further provides a thesaurus generating apparatus, including the following components: The collecting unit 501 is configured to collect input behavior information of each user, where the input behavior information includes a word selected by a user, and a code input by the user. a character string and a corresponding user word frequency, wherein the user word frequency is frequency information of the user inputting the word and the corresponding encoded character string;

The cumulative word frequency calculation unit 502 performs weight correction on the word frequency of each user corresponding to the word and the encoded character string, and calculates the cumulative word frequency of the word and the encoded user as a whole;

The thesaurus generating unit 503 is configured to generate a thesaurus, the thesaurus includes a word, an encoded string, and corresponding word frequency information.

The thesaurus generating apparatus shown in FIG. 5 may further include: a matching unit 504, configured to compare the generated thesaurus and the existing thesaurus, wherein the existing thesaurus stores words, encoded strings, and The corresponding system word frequency; the new code string determining unit 505 is configured to determine a new code string corresponding to the word according to the preset rule. Then, the thesaurus generating device can implement acquisition of a new encoded string.

Preferably, the vocabulary generating device further includes: a filtering unit 506, configured to remove an encoded character string whose user accumulated word frequency is less than or equal to a preset threshold.

The statistical unit 507 counts the number of occurrences of the word with the new encoded string in the preset Internet page database, and obtains the Internet word frequency. The Internet page database forms a Internet page database by weighting the Internet page and storing the Internet page whose weight value is greater than or equal to the preset threshold.

Word frequency allocation unit 508: user cumulative word frequency used to compare the new encoded string of the word with the original The user of the encoded string accumulates the word frequency, and according to the comparison result, allocates the Internet word frequency to two or more corresponding encoded strings of the word.

Referring to Figure 6, another lexicon generating apparatus is shown, which includes the following components:

The collecting unit 601 is configured to collect input behavior information of each user, where the input behavior information includes a word selected by the user, a coded string input by the user, and a corresponding user word frequency, where the user word frequency is the user inputting the word and Corresponding to the frequency information of the encoded string;

The cumulative word frequency calculation unit 602 performs weight correction on the word frequency of each user corresponding to the word and the encoded character string, and calculates the cumulative word frequency of the word and the encoded user as a whole;

The thesaurus generating unit 603 is configured to generate a thesaurus, the thesaurus includes words, encoded strings, and corresponding word frequency information.

The comparison unit 604 is configured to compare the generated thesaurus and the existing thesaurus, wherein the existing thesaurus stores the words, the encoded strings, and the corresponding system word frequencies;

An expired word determining unit 605, configured to determine an expired word, the expired word is a word that does not exist in the generated thesaurus, but exists in an existing thesaurus; or the generated word The words in the library that accumulate the word frequency according to the preset conditions. For example, the user cumulative word frequency is less than or equal to a predetermined threshold.

After the device shown in Figure 6 gets the expired words, the existing thesaurus can be streamlined according to these expired words to prevent the existing thesaurus from getting bigger and bigger, for example, filtering from the existing thesaurus, The expired words are deleted, thereby reducing the volume of the thesaurus, improving the utilization of the thesaurus, and improving the input efficiency.

Due to the limited space of the present invention, the description of the method is more detailed, and the description of the system portion is not detailed. Please see the relevant section above.

The above provides a method for obtaining a new encoded character string of an input method word, an input method system and a vocabulary generating device provided by the present invention, and a specific example is applied to the principle of the present invention. The embodiments have been described, and the description of the above embodiments is only for helping to understand the method of the present invention and its core ideas. Meanwhile, for those skilled in the art, according to the idea of the present invention, in the specific embodiments and application scope There are variations, and the description should not be construed as limiting the invention.

Claims

Rights request

A method for obtaining a new encoded character string of an input method word, comprising: extracting a word selected by a user in an input process, and an encoded character string input by a user; The encoded string input by the user is compared with an existing thesaurus, wherein the existing thesaurus stores the existing words and their corresponding encoded strings;

Determine the new encoded string corresponding to the word according to the preset rules.

2. The method of claim 1, further comprising:

Record the word selected by the user and the encoded string input by the user to the user vocabulary;

And during the user input process, the user word frequency is recorded to the user vocabulary, and the user word frequency is the frequency information of the user and the corresponding encoded character string.

3. The method of claim 2, further comprising:

According to the current application input by the user, the corresponding word weight information is corrected by corresponding weight correction, and the user word frequency is obtained.

4. The method of claim 2, further comprising:

Collecting a word record of each user having a new encoded string, the record including the word, a corresponding new encoded string, and corresponding word frequency information;

Remove duplicate word records.

5. The method of claim 4, further comprising:

Calculate the cumulative word frequency of the user;

The encoded string whose user cumulative word frequency is less than or equal to the preset threshold is removed.

6. The method according to claim 4 or 5, further comprising:

The number of times the words in the filtered word records appear in the preset Internet page database, and the Internet word frequency is obtained.

7. The method of claim 6, further comprising:

Comparing the user cumulative word frequency of the new encoded string of the word with the user accumulated word frequency of the original encoded string, according to the comparison result, the Internet word frequency is assigned to two or more corresponding coded characters of the word.

8. The method of claim 7, further comprising:

Generate a new lexicon based on the filtered word record or add the filtered word record to the original Thesaurus, get a new thesaurus or a new version of the thesaurus.

9. The method of claim 8 wherein:

The collected information further includes area information of the user, and divides the user into several areas; and performs a filtering step for each area;

Generate a regional new thesaurus or a new version of the regional full thesaurus for each region.

10. The method of claim 6, wherein the preset Internet page database is obtained by the following steps:

Weighting the Internet page;

The Internet page whose weight value is greater than or equal to the preset threshold is stored in the Internet page database.

11. The method of claim 4, wherein the collecting is: the input method computing device sends the user's word record with the new encoded character string to the collection computing device in real time or at a time.

12. A method for obtaining a new encoded character string of an input method word, comprising: extracting a word selected by a user during an input process, and an encoded character string input by a user, and storing the encoded character string in the user vocabulary;

Collect user vocabularies for individual users;

Comparing the collected user vocabulary and the input method existing vocabulary, the system vocabulary stores words and corresponding code strings;

13. The method of claim 12, further comprising:

The user vocabulary further includes a user word frequency, and the user word frequency is frequency information of the user inputting the word and the corresponding encoded character string;

Calculate the cumulative word frequency of the user;

14. The method according to claim 13, wherein the preset rule is: if the word selected by the user exists in the existing thesaurus, but the encoded string input by the user is in the existing thesaurus. If the stored encoded string of the word is different, it is determined that the encoded string input by the user is a new encoded string corresponding to the word; Further comparing the user cumulative word frequency and the system word frequency of the corresponding encoded string of the word, the system word frequency is the word frequency information corresponding to the existing word preset in the existing thesaurus, if the user accumulates the word frequency and the system word frequency If the ratio is greater than or equal to the predetermined threshold, it is determined that the encoded string input by the user is a new encoded string corresponding to the word.

The method according to claim 12 or 14, further comprising: counting the number of occurrences of the word having the newly encoded character string in the preset Internet page database, and obtaining the Internet word frequency.

16. The method of claim 15, further comprising:

17. An input method system comprising an input interface unit, a display unit, and a system vocabulary, the feature comprising:

a word extraction unit, connected to the input method system, for extracting words selected by the user during the input process, and an encoded string input by the user;

a word matching unit, connected to the word extracting unit, configured to compare the selected word of the user, the encoded string input by the user, and the system vocabulary, wherein the system vocabulary stores the word and its corresponding Encoded string; Determines the new encoded string corresponding to the word according to the preset rules.

18. The input method system of claim 17 wherein:

The input interface unit, the display unit, and the system vocabulary of the input method system are located in the same computing device;

Or the input interface unit and the display unit of the input method system are located in the first computing device, and the system vocabulary is located in the second computing device, and the input method system obtains corresponding information from the second computing device according to the information input by the user. Information, the corresponding character is displayed on the first computing device.

The input method system of claim 17, further comprising:

a communication unit for transmitting a word record having a new encoded string in real time or timing, the word record including the word and its corresponding new encoded string.

The input method system of claim 17, further comprising:

a word frequency recording unit, connected to the input method system, for recording user words during user input Frequency, the user word frequency is the frequency information of the user inputting the word and its corresponding encoded character string; the user vocabulary is used for storing the word selected by the user, the encoded character string input by the user and the corresponding user word frequency.

The input method system of claim 17, further comprising:

An application determining unit, configured to determine a current application input by the user, and send the determination result to the word frequency recording unit;

The word frequency recording unit is connected to the input method system, and is configured to, according to the current application program input by the user, respectively, according to the current application program input by the user, the corresponding word frequency information is corrected, and the word frequency of the user is obtained.

22. A lexicon generating apparatus, comprising:

a word collecting unit, configured to collect a word record of each user having a new encoded string, the word record including the word and its corresponding new encoded character string;

a first filtering unit, configured to remove duplicate word records;

The thesaurus generating unit is configured to generate a new thesaurus according to the filtered word records or add the filtered word records to the original thesaurus to obtain a new thesaurus or a new version of the full thesaurus.

The device of claim 22, further comprising:

a word frequency collecting unit, configured to collect a user word frequency in a user input behavior, where the user word frequency is a frequency information that the user inputs the word and the corresponding encoded character string;

a cumulative word frequency calculation unit, configured to calculate a cumulative word frequency of the user;

And a second filtering unit, configured to remove the encoded character string whose user cumulative word frequency is less than or equal to the preset threshold.

The device according to claim 22 or 23, further comprising:

A statistical unit is used to count the number of occurrences of words in the filtered word record in the preset Internet page database, and obtain the Internet word frequency.

The device of claim 22, further comprising:

Word frequency allocation unit: a user cumulative word frequency for comparing a new encoded character string of the word with a user cumulative word frequency of the original encoded character string, and according to the comparison result, assigning the Internet word frequency to two or more corresponding coded characters of the word string.

26. A lexicon generating apparatus, comprising:

a collecting unit, configured to collect input behavior information of each user, where the input behavior information includes a user a selected word, a coded string input by the user, and a corresponding user word frequency, wherein the user word frequency is frequency information of the user inputting the word and the corresponding encoded character string;

The thesaurus generating unit is configured to generate a thesaurus, the thesaurus includes words, encoded strings and corresponding word frequency information.

27. The device of claim 26, further comprising:

a comparison unit, configured to compare the generated thesaurus and the existing thesaurus, wherein the existing thesaurus stores words, encoded strings, and corresponding system word frequencies;

a determining unit, configured to determine a new encoded string corresponding to the word according to the preset rule.

The device of claim 27, further comprising:

The filtering unit is configured to remove the encoded string of the user whose accumulated word frequency is less than or equal to the preset threshold.

The device according to claim 27 or 28, further comprising:

The statistical unit counts the number of times the word with the new encoded string appears in the preset Internet page database, and obtains the Internet word frequency;

30. The device of claim 26, further comprising:

An expired word determining unit, configured to determine an expired word; the expired word is a word that does not exist in the generated thesaurus, but exists in an existing thesaurus, or the expired word is The words accumulated by the user in the generated thesaurus meet the pre-conditions.