WO2008028421A1 - Procédés permettant d'obtenir une nouvelle chaîne de caractères codés, système et procédé de saisie et dispositif de génération de base de mots - Google Patents

Procédés permettant d'obtenir une nouvelle chaîne de caractères codés, système et procédé de saisie et dispositif de génération de base de mots Download PDF

Info

Publication number
WO2008028421A1
WO2008028421A1 PCT/CN2007/070518 CN2007070518W WO2008028421A1 WO 2008028421 A1 WO2008028421 A1 WO 2008028421A1 CN 2007070518 W CN2007070518 W CN 2007070518W WO 2008028421 A1 WO2008028421 A1 WO 2008028421A1
Authority
WO
WIPO (PCT)
Prior art keywords
word
user
input
encoded
word frequency
Prior art date
Application number
PCT/CN2007/070518
Other languages
English (en)
Chinese (zh)
Inventor
Qi Guo
Zijian Tong
Lei Yang
Original Assignee
Beijing Sogou Technology Development Co., Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Sogou Technology Development Co., Ltd. filed Critical Beijing Sogou Technology Development Co., Ltd.
Publication of WO2008028421A1 publication Critical patent/WO2008028421A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding

Definitions

  • the present invention relates to the field of input methods, and in particular, to a method for acquiring a new encoded character string of an input method word, an input method system, and a thesaurus generating device.
  • the existing input method system is based on the input string of the user input, and matches the words required by the user, for example, Chinese, Japanese, and Korean input method systems.
  • a corresponding code string is set for each word, and the user can only obtain the desired word by inputting the correct code string.
  • the user has a learning process for the correct encoded string. It is difficult to ensure that the correspondence between all encoded strings and words recognized by the user is correct, so the existing input method system improves fault tolerance and satisfies some users.
  • the fuzzy sound solution can solve some of the problems because of the North and South languages, but because each region has its own dialect (especially for the dialects with a large number of dialects in Chinese), so when users use the phonetic code to input words, more or less input code There are times when there are less accurate problems, and the above fuzzy sound solution does not solve all the problems.
  • the technical problem to be solved by the present invention is to provide a new encoded string for obtaining input method words.
  • the method and device can obtain a new code string used by each user, and aggregate the generated lexicon, thereby satisfying the idiom of the user's new code string and improving the hit rate of the user's preferred word.
  • Another object of the present invention is to provide an input method system, which can automatically and automatically acquire the encoded character string used by the user for some words in a simple, convenient, timely and effective manner, and obtain a new coded string used by each user by comparison. .
  • Another object of the present invention is to provide a thesaurus generating apparatus capable of efficiently providing a relatively complete whole dictionary or a new thesaurus including a new encoded character string suitable for user input habits.
  • the present invention provides a method for obtaining a new encoded character string of an input method word, which comprises: extracting a word selected by a user in an input process, and an encoded character string input by a user; The selected word and the encoded string input by the user are compared with the existing thesaurus, wherein the existing thesaurus stores the existing words and their corresponding encoded strings; according to the preset rules, the corresponding words are determined. New encoded string.
  • the method further includes: recording a word selected by the user and a code string input by the user to the user vocabulary; and during the user input process, recording the user word frequency to the user vocabulary, wherein the user word frequency is The user enters frequency information for the word and its corresponding encoded string.
  • the method further includes: according to the current application input by the user, respectively, the corresponding weight corrected statistical word frequency information is used to obtain the user word frequency.
  • the method further includes: collecting a word record of each user having a new encoded character string, the record including the word, the corresponding new coded string, and corresponding word frequency information; removing duplicate words recording.
  • the method further includes: calculating a cumulative word frequency of the user; and removing the encoded character string whose user cumulative word frequency is less than or equal to the preset threshold.
  • the method further includes: counting the number of occurrences of the words in the filtered word record in the preset Internet page database, and obtaining the Internet word frequency.
  • the method further includes: comparing a user cumulative word frequency of the new encoded character string of the word with a user cumulative word frequency of the original encoded character string, and according to the comparison result, allocating the Internet word frequency to the two words of the word or Multiple corresponding encoding strings.
  • the method further includes: generating a new thesaurus according to the filtered word records or adding the filtered word records to the original thesaurus to obtain a new thesaurus or a new version of the full thesaurus.
  • the collected information further includes area information of the user, and divides the user into several areas; performs a filtering step for each area; generates a new regional vocabulary or a new version of the regional full vocabulary for each area.
  • the preset Internet page database is obtained by the following steps: weighting the Internet page; and storing the Internet page whose weight value is greater than or equal to the preset threshold to the Internet page database.
  • the collection is: the input method computing device sends the user's word record with the new encoded string to the collection computing device in real time or at a time.
  • the present invention also provides a method for obtaining a new encoded character string of an input method word, comprising: extracting a word selected by a user in an input process, and a coded string input by a user, and storing the same in a user vocabulary; User vocabulary of each user; comparing the collected user vocabulary and the input vocabulary of the input method, the system vocabulary stores the words and their corresponding encoded strings; according to the preset rules, determining the corresponding words The new encoded string.
  • the method further includes: the user vocabulary further includes a user word frequency, wherein the user word frequency is frequency information of the user inputting the word and the corresponding encoded character string; calculating a cumulative word frequency of the user; An encoded string whose word frequency is less than or equal to the preset threshold.
  • the preset rule is:
  • the encoded string input by the user is the word a new coded string corresponding to the word; further comparing the user cumulative word frequency and the system word frequency of the corresponding encoded string of the word, the system word frequency being the word frequency information corresponding to the existing word preset in the existing thesaurus, If the ratio of the cumulative word frequency of the user to the system word frequency is greater than or equal to the predetermined threshold, it is determined that the encoded character string input by the user is a new encoded character string corresponding to the word.
  • the method further includes: counting the number of occurrences of the word with the newly encoded character string in the preset internet page database, and obtaining the internet word frequency.
  • the method further includes: comparing a user cumulative word frequency of the new encoded character string of the word with a cumulative word frequency of the original encoded character string, and assigning the Internet word frequency to the word according to the comparison result Two or more corresponding encoded strings of words.
  • the invention also provides an input method system, comprising an input interface unit, a display unit and a system vocabulary, further comprising: a word extraction unit, connected to the input method system, for extracting words selected by the user during the input process And a coded string input by the user; the word matching unit is connected to the word extracting unit, and is configured to compare the selected word of the user, the encoded string input by the user, and the system vocabulary, the system lexicon The word and its corresponding encoded string are stored; according to the preset rule, the new encoded string corresponding to the word is determined.
  • a word extraction unit connected to the input method system, for extracting words selected by the user during the input process And a coded string input by the user
  • the word matching unit is connected to the word extracting unit, and is configured to compare the selected word of the user, the encoded string input by the user, and the system vocabulary, the system lexicon
  • the word and its corresponding encoded string are stored; according to the preset rule, the new encoded string corresponding to the
  • the input interface unit, the display unit, and the system vocabulary of the input method system are located in the same computing device; or the input interface unit and the display unit of the input method system are located in the first computing device, and the system vocabulary is located at the first In the second computing device, the input method system obtains corresponding information from the second computing device according to the information input by the user, and displays the corresponding character in the first computing device.
  • the input method system further includes: a communication unit, configured to send a word record with a new encoded string in real time or timing, the word record including the word and its corresponding new encoded string .
  • a communication unit configured to send a word record with a new encoded string in real time or timing, the word record including the word and its corresponding new encoded string .
  • the input method system further includes: a word frequency recording unit, connected to the input method system, configured to record a user word frequency in a user input process, wherein the user word frequency is a user inputting the word and a corresponding coded character thereof The frequency information of the string; the user vocabulary, used to store the words selected by the user, the encoded string input by the user, and the corresponding user word frequency.
  • a word frequency recording unit connected to the input method system, configured to record a user word frequency in a user input process, wherein the user word frequency is a user inputting the word and a corresponding coded character thereof The frequency information of the string; the user vocabulary, used to store the words selected by the user, the encoded string input by the user, and the corresponding user word frequency.
  • the input method system further includes: an application determining unit, configured to determine a current application input by the user, and send the determination result to the word frequency recording unit; the word frequency recording unit is connected to the input method system, and is configured to In the user input process, according to the current application input by the user, the corresponding frequency is corrected and the word frequency information is obtained, and the user word frequency is obtained.
  • an application determining unit configured to determine a current application input by the user, and send the determination result to the word frequency recording unit
  • the word frequency recording unit is connected to the input method system, and is configured to In the user input process, according to the current application input by the user, the corresponding frequency is corrected and the word frequency information is obtained, and the user word frequency is obtained.
  • the present invention also provides a thesaurus generating apparatus, comprising: a word collecting unit, configured to collect a word record of each user having a new encoded character string, the word record including the word and its corresponding new code a string; a first filtering unit, configured to remove duplicate word records; a thesaurus generating unit, configured to generate a new thesaurus according to the filtered word records or add the filtered word records to the original thesaurus, Get a new thesaurus or a new version of the thesaurus.
  • a word collecting unit configured to collect a word record of each user having a new encoded character string, the word record including the word and its corresponding new code a string
  • a first filtering unit configured to remove duplicate word records
  • a thesaurus generating unit configured to generate a new thesaurus according to the filtered word records or add the filtered word records to the original thesaurus, Get a new thesaurus or a new version of the thesaurus.
  • the device further includes: a word frequency collecting unit, configured to collect a user word frequency in a user input behavior, where the user word frequency is a frequency signal that the user inputs the word and the corresponding encoded character string
  • the accumulated word frequency calculation unit is configured to calculate a cumulative word frequency of the user
  • the second filtering unit is configured to remove the encoded character string whose user cumulative word frequency is less than or equal to the preset threshold.
  • the device further includes: a statistical unit, configured to count the number of occurrences of the words in the filtered word record in the preset Internet page database, and obtain the Internet word frequency.
  • a statistical unit configured to count the number of occurrences of the words in the filtered word record in the preset Internet page database, and obtain the Internet word frequency.
  • the device further includes: a word frequency allocation unit: a user cumulative word frequency for comparing the new word frequency of the word with the original code string of the original code string, and according to the comparison result, assigning the Internet word frequency to the Two or more corresponding encoded strings of words.
  • a word frequency allocation unit a user cumulative word frequency for comparing the new word frequency of the word with the original code string of the original code string, and according to the comparison result, assigning the Internet word frequency to the Two or more corresponding encoded strings of words.
  • the invention also provides a thesaurus generating device, comprising:
  • a collecting unit configured to collect input behavior information of each user, where the input behavior information includes a word selected by the user, a coded string input by the user, and a corresponding user word frequency, where the user word frequency is the user inputting the word and corresponding The frequency information of the encoded string;
  • the cumulative word frequency calculation unit performs weight correction on the word frequency of each user corresponding to the word and the encoded character string, and calculates the cumulative word frequency of the word and the encoded user as a whole;
  • the thesaurus includes words, encoded strings, and corresponding word frequency information.
  • the device further includes: a comparison unit, configured to compare the generated thesaurus and the existing thesaurus, wherein the existing thesaurus stores words, encoded strings and corresponding systems thereof a word frequency; a determining unit, configured to determine a new encoded string corresponding to the word according to a preset rule.
  • the device further includes: a filtering unit, configured to remove an encoded character string whose user accumulated word frequency is less than or equal to a preset threshold.
  • a filtering unit configured to remove an encoded character string whose user accumulated word frequency is less than or equal to a preset threshold.
  • a statistical unit that counts the number of occurrences of a word with a new encoded string in a preset Internet page database, and obtains an Internet word frequency
  • a word frequency allocation unit a user cumulative word frequency and original code for comparing the new encoded character string of the word The user of the string accumulates the word frequency, and according to the comparison result, allocates the Internet word frequency to two or more corresponding encoded strings of the word.
  • the device further includes: a comparison unit, configured to compare the generated thesaurus and the existing thesaurus, wherein the existing thesaurus stores the words, the encoded strings, and corresponding a system word frequency; a determining unit, configured to determine an expired word; the expired word is a word that does not exist in the generated thesaurus, but a word existing in the existing thesaurus, or the expired word is In the generated thesaurus, the user accumulates words whose word frequency meets the preset condition.
  • a comparison unit configured to compare the generated thesaurus and the existing thesaurus, wherein the existing thesaurus stores the words, the encoded strings, and corresponding a system word frequency
  • a determining unit configured to determine an expired word
  • the expired word is a word that does not exist in the generated thesaurus, but a word existing in the existing thesaurus, or the expired word is In the generated thesaurus, the user accumulates words whose word frequency meets the preset condition.
  • the present invention has the following advantages:
  • the present invention proposes a distributed architecture, including multiple clients and a collector, through Extracting the words and encoded strings input by the user on the user side, and comparing with the existing thesaurus to learn a new encoded string suitable for the user's usage habit; then collecting and summarizing the new encoded strings of each user and their corresponding words
  • the word, analysis and filtering can obtain a new code string with universal significance; the invention provides a solution from the perspective of user input, and can timely and comprehensively learn the new code string used by the user in the input process, including A new encoded string that reflects the user's dialect habits, as well as new, unimaginable, but often used, new encoded strings that improve the accuracy of preferred words.
  • the present invention places the obtained new encoded character string and its words into a selected Internet page database, and counts the number of occurrences thereof to obtain the Internet word frequency; and according to the user word frequency on the new and old encoded string of the word Distribution, after the Internet word frequency is corrected and given to the old and new coded strings, the most scientific word frequency results can be obtained, thereby avoiding the input efficiency and input experience of other normal users due to the usage habits of some users.
  • the present invention can also be used to collect only new coded strings of users who count a certain area, and obtain the language habits or coding habits of the users in the area, thereby providing an input method system of different pronunciations or coded versions of each area or The input method system allows the user to set the desired area of the area in which they are located.
  • 1 is a flow chart showing the steps of a preferred embodiment of the method for obtaining a new encoded character string of an input method word
  • FIG. 2 is a flow chart showing the steps of another method for obtaining a new encoded character string of an input method word
  • FIG. 3 is a structural block diagram of the input method system
  • FIG. 4 is a structural block diagram of the thesaurus generating apparatus
  • Figure 5 is a block diagram showing the structure of a thesaurus generating apparatus for determining a new encoded character string
  • Fig. 6 is a block diagram showing the structure of a thesaurus generating apparatus for determining expired words.
  • FIG. 1 an excellent method for obtaining a new encoded character string of an input method word according to the present invention is shown in FIG.
  • the flow chart of the steps of the selected embodiment includes the following steps:
  • Step 101 In the user input process, extract the word selected by the user and the code string input by the user.
  • Step 101 can be completed by the input method system, and the input method system can extract the words selected by the user and the encoded string input by the user in any feasible manner during the user input process.
  • the extracted information can be directly performed in step 102, or stored in the user vocabulary, and the user vocabulary is compared with the system vocabulary after a certain time interval.
  • Step 101 is to record the word selected by the user who inputs the information of the user and the encoded string of the input.
  • the encoded character string may be a pinyin code or a font code, that is, the present invention can be applied to various input methods, and of course, the Chinese input method suitable for inputting and using the code is preferably used.
  • the words selected by the user will include some words that the user is used to dialing the phonetic code in the dialect, for example: "Fold,,, the user enters the correct encoding string - - "zhele”; but the input method is original
  • the word string corresponding to the word in the lexicon is "shele", so it cannot be directly displayed to the user in the candidate word, and the user needs to select each word to obtain the desired vocabulary.
  • There are many such words for example, Turning heads “diaotou,,” “tiaotou”; urinary “niaoniao,,” “niaosui”, etc., there are many cases that we can't count. With this invention, we can find such new coded strings as soon as possible, thus Improve the accuracy of preferred words in user input.
  • the user can also create some new coded characters corresponding to the words that are not used in the original thesaurus by the artificial word-making function provided by the input method (for example, Microsoft Pinyin input method or double spell input method). String, so that the user can select the desired word during the input process.
  • the word “Traditional” is a place name in Shanxi.
  • the corresponding encoded string is "fanshi”.
  • the word " ⁇ " in the input method generally corresponds to the code string "shi, zhi”.
  • Step 102 Compare the selected word of the user and the encoded string input by the user with the existing thesaurus.
  • the specification uses the system vocabulary agent to represent the existing vocabulary, because the existing system vocabulary stores the existing words and their corresponding code strings.
  • Step 103 If the word selected by the user exists in the system vocabulary, but the encoded string input by the user is different from the encoded string corresponding to the word stored in the system vocabulary, determining that the encoded string input by the user is The new encoded string corresponding to the word.
  • the user's coding habit can be automatically and simply obtained. Then, after collecting the new coded strings of multiple users and their corresponding words in various ways, and removing the filtering steps such as repeated word records, a new coded string in a general sense can be obtained.
  • the collecting may be: the input method user computing device sends the new code string of the user and its corresponding word to the word collection computing device in real time or timing, that is, the input method computing device has an automatic sending Module.
  • the collection computing device exists in the form of a server.
  • the collecting may also send the new code string and its corresponding word to the collecting end for the input method user periodically or irregularly, that is, the sending is manually initiated by the user, for example, each user will have his own new code.
  • Strings and their corresponding words are sent to a unified email address or unified server for collection.
  • the vocabulary storing the user's personal words can be sent to the collection computing device in real time or periodically. For example, each user can collect the lexicon by backing up the server periodically or irregularly.
  • the collection of the user's new code string and its corresponding words is simpler, because this
  • the input method system used by the user is itself a server, which can be used by multiple users, and the input behavior information of each user can be collected during use.
  • the present invention is feasible in any way that enables information collection, and is no longer - an illustration.
  • step 101 further includes: recording user word frequency to the user during user input
  • the household vocabulary includes a plurality of word records, the word records including the words, corresponding new encoded strings, and corresponding user word frequencies.
  • the process of collecting the user word frequency in step 101 may be: according to the current application input by the user, respectively, the corresponding word weight information is corrected by the corresponding weight, and the user word frequency is obtained.
  • the method further includes:
  • Step 104 Collect a word record of each user with a new code string, where the record includes the word, a corresponding new code string, and a corresponding user word frequency;
  • Step 105 Remove duplicate word records.
  • Step 106 Calculate a cumulative word frequency of the user corresponding to the encoded string
  • the calculation process of the accumulated word frequency of the user can obtain the cumulative word frequency of the user after collecting the summarized words by simply superimposing the user word frequency of each user.
  • the calculation process of the accumulated word frequency of the user may also perform weight correction on each word frequency of each word corresponding to the word, and calculate the cumulative word frequency of the user of each word; the weight correction may be performed by analyzing the word frequency of each user corresponding to a certain word. After completion, for example, firstly, the word frequency of each user corresponding to the word is analyzed, and the distribution trend is found, and the probability of occurrence of a certain word frequency value or the frequency value of the word frequency is corrected from the average range.
  • the cumulative word frequency calculated by the above-mentioned correction can remove some users' accidental behavior or malicious behavior, and obtain objective and accurate user cumulative word frequency, thereby ensuring the accuracy of the thesaurus.
  • Step 107 Remove the encoded character string whose user cumulative word frequency is less than or equal to the preset threshold. This step can remove the special input habits of some individual users who do not have universal significance, and can guarantee the objectivity and accuracy of the new encoded string obtained.
  • Step 108 Count the number of times the words in the filtered word record appear in the preset Internet page database, and obtain the Internet word frequency.
  • the step 108 may further include a weight assignment step: assigning a weight to the Internet page; and storing the Internet page whose weight value is greater than or equal to the preset threshold to the Internet page database.
  • the process is an optional step, the purpose of which is to obtain a database of selected Internet pages. To ensure the accuracy of the screening of new words. Of course, other methods can also be used to form a preset Internet page database.
  • the step of weighting it is a relatively important situation to assign a corresponding weight value according to the time formed by the webpage and the type of the webpage. Because for Internet word frequency statistics, the impact of web page time is very important, so the impact of web page time on the weight value is greater. The farther the time point from the word frequency statistics is, the lower the weight value is. If the time difference is greater than certain The value of the page can give the page a lower weight value, even excluded from the word frequency statistics. Secondly, the type of webpage has a great influence on the word frequency statistics.
  • the webpage type generally refers to a portal website, a forum or some other determined webpages. The weight value of these webpages is higher because there are more participants and information in these webpages.
  • a rule base can be set, and the URL addresses of some webpages are stored in the library, thereby determining that the webpages of these URLs are more important for word frequency statistics, and the words appearing on these webpages are preferred.
  • the web page is given a greater weight value.
  • the present invention can further remove some duplicate web pages, yellow web pages and spam web pages by giving lower weight values, thereby further ensuring the accuracy of new word verification.
  • the vocabulary can be set to include the word, the corresponding Internet word frequency of the word, and the corresponding user cumulative word frequency of the word.
  • the word “heavy” has an Internet word frequency and two user cumulative word frequencies in the thesaurus, corresponding to "chongchong""zhongzhong” 0.
  • Internet word frequency can improve the accuracy of word frequency, but because words cannot be found on the Internet. Reflecting the encoded string, the user can accumulate the word frequency to reflect the user's input habits and improve the hit rate of the preferred word.
  • step 108 may not be needed, and the vocabulary may be included to include the word, the original word frequency of the word, and the corresponding cumulative word frequency of the word.
  • the above-mentioned one word corresponds to two word frequency.
  • the use process is complicated, and two types of word frequency data are needed to achieve the best effect.
  • the preferred embodiment shown in FIG. 1 may further include step 109, the above two types.
  • the word frequency data is adjusted to a word frequency data.
  • Step 109 Allocate two or more corresponding encoded character strings of the Internet word frequency to the word according to the ratio of the user cumulative word frequency of the new encoded character string of the word to the cumulative word frequency of the original encoded character string. That is to say, the words appearing in the Internet correspond to two or more corresponding encoded character strings, and according to the difference of the cumulative word frequency of the input character string input by the user, the Internet word frequency reflecting the total word frequency of the word is allocated to the two words of the word.
  • One or more corresponding encoding strings thereby objectively and accurately reflecting the user's input habits, and improving the accuracy of the preferred words in the user input process.
  • step 109 only gives an example in the allocation of Internet word frequency.
  • Step 1010 Generate a new thesaurus according to the filtered word records or add the filtered word records to the original thesaurus to obtain a new thesaurus or a new version of the full thesaurus.
  • the word record includes the word, the corresponding new encoded string, and corresponding word frequency information.
  • Fig. 1 can be used to collect new code strings for users nationwide, and then to derive a new thesaurus or a new version of the full thesaurus suitable for most people, thereby improving the input experience of users in various regions.
  • the embodiment shown in FIG. 1 can also be used in the following cases: the collected new user-encoded character string is still collected, and the collected information further includes the area information of the user, and the user is divided into several areas; The area performs the filtering step; a new regional vocabulary or a new version of the regional vocabulary is generated for each region. That is, the different pronunciations of people in each area can be separately counted, the input method system of different coding versions of each area can be provided, or the user can set the area where the user is located in the input method system, thereby more personalizedly satisfying the coding of users in each area. Input habits.
  • the new thesaurus obtained in the above steps or the new version of the full thesaurus can be used to update the input method.
  • the stored new vocabulary or the new version of the second vocabulary second computing device may exist in the network in the form of a server, and provide a vocabulary update service to any other client program that needs to input the new vocabulary information. Of course, it does not need to be in the form of a fixed server. It can also exist in In a local computing device, the vocabulary update service is provided to any client terminal of other terminals that needs to input the new word information through P2P (peer-to-peer) technology.
  • P2P peer-to-peer
  • the updating may be performed by: updating the system vocabulary at the same time when the input method system is updated; or performing online update of the system vocabulary by means of the server actively pushing; or, by the user The request is initiated, and the server returns data according to the request to update the system vocabulary.
  • the server returns data according to the request to update the system vocabulary.
  • various data update methods may be used, and the present invention is not limited thereto, and those skilled in the art may select them according to needs.
  • setting a unit for receiving user input information and displaying corresponding characters in the input method system is located in the first computing device; the obtained new thesaurus or the new version of the full thesaurus is the input method system a system vocabulary, the system vocabulary is located in a second computing device; the input method system obtains corresponding information from a system vocabulary located in the second computing device according to information input by the user, and displays corresponding characters in the first computing device , complete the text input.
  • the new thesaurus or the new version of the whole thesaurus obtained according to the present invention can be directly used as the system vocabulary of the input method system, and the online thesaurus can be used without updating operations.
  • the input method system is divided into two parts, the receiving and displaying unit is located in the first computing device, and the thesaurus information is located in the second computing device, which can perfectly implement the online application of the input method; of course, the encoding required for the input method system
  • the matching process can be arbitrarily set in a computing device as needed.
  • a flow chart of another method for obtaining a new encoded character string of an input method word according to the present invention includes the following steps:
  • Step 201 Extract a word selected by the user in the input process, and a coded character string input by the user, and store the coded character string in the user vocabulary;
  • Step 202 Collect a user vocabulary of each user
  • Step 203 Compare the collected user vocabulary and the input method system vocabulary, where the system vocabulary stores words and corresponding code strings;
  • Step 204 Determine, according to the preset rule, a new coded string corresponding to the word.
  • the preset rule may be: if the word selected by the user exists in the system lexicon, but the encoded string input by the user is different from the encoded string corresponding to the word stored in the system vocabulary, then the user input is determined.
  • the encoded string is the new encoded string corresponding to the word.
  • the preset rule may be: if the selected word of the user and the encoded string input by the user exist in the existing thesaurus, the user cumulative word frequency and system for further comparing the corresponding encoded string of the word Word frequency, the word frequency of the system is the word frequency information corresponding to the existing words preset in the existing lexicon. If the ratio of the cumulative word frequency of the user to the system word frequency is greater than or equal to the predetermined threshold, the code string input by the user is determined to be The new encoded string corresponding to the word.
  • the embodiment shown in FIG. 2 is basically similar to the concept of the embodiment shown in FIG. 1.
  • the main difference is that the user vocabulary of multiple users is collected first, and then the comparison is performed uniformly, and the new encoded character string of the user is obtained according to the comparison result;
  • This method can reduce the number of comparison calculations, and can reduce the burden of the local input method system. It can be directly used in the existing input method system, but the comparison will be performed after a large number of user-selected words are collected, and the server will be added. System burden. For both, those skilled in the art can select and use as needed.
  • the embodiment shown in FIG. 2 may further include a filtering step: wherein the user vocabulary calculates a user cumulative word frequency corresponding to the encoded character string; and removes the encoded character string whose user cumulative word frequency is less than or equal to the preset threshold.
  • a filtering step wherein the user vocabulary calculates a user cumulative word frequency corresponding to the encoded character string; and removes the encoded character string whose user cumulative word frequency is less than or equal to the preset threshold.
  • the embodiment shown in FIG. 2 may further include a word frequency giving step: counting the number of occurrences of the word with the new encoded character string in the preset Internet page database, obtaining the Internet word frequency; comparing the new coded characters of the word The user of the string accumulates the word frequency and the user's accumulated word frequency of the original encoded string, and according to the comparison result, the Internet word frequency is allocated to two or more corresponding encoded strings of the word.
  • a word frequency giving step counting the number of occurrences of the word with the new encoded character string in the preset Internet page database, obtaining the Internet word frequency; comparing the new coded characters of the word The user of the string accumulates the word frequency and the user's accumulated word frequency of the original encoded string, and according to the comparison result, the Internet word frequency is allocated to two or more corresponding encoded strings of the word.
  • FIG. 3 it is a structural block diagram of an input method system according to the present invention, which includes an input interface unit 301, a display unit 302, and a system vocabulary 303.
  • the input method system further includes:
  • a word extracting unit 304 connected to the input method system, for extracting a word selected by the user and a coded string input by the user during the user input process;
  • the word matching unit 305 is connected to the word extracting unit 304, and is configured to compare the selected word of the user, the encoded string input by the user, and the system vocabulary, wherein the system vocabulary stores the word and Corresponding encoded string; According to the preset rules, the new encoded string corresponding to the word is determined.
  • the preset rule may be: if the user selected word exists in the system vocabulary, but the user loses If the encoded string entered is different from the encoded string corresponding to the word stored in the system vocabulary, it is determined that the encoded string input by the user is a new encoded string corresponding to the word.
  • the input method system can be used to extract a new code string of the user in addition to the ordinary word input.
  • the input method system may be a common input method system.
  • the input interface unit 301, the display unit 302, and the system vocabulary 303 of the input method system are located in the same computing device, and the input method system is based on the encoded information input by the user.
  • the corresponding characters are displayed locally by local query matching.
  • the input method system may also be a network input method system.
  • the input interface unit 301 and the display unit 302 of the input method system are located in a first computing device, and the system vocabulary 303 is located in a second computing device. The system obtains corresponding information from the second computing device according to the information input by the user, and displays the corresponding character in the first computing device.
  • the input method system preferably further comprises: a communication unit
  • the input method system preferably further includes: a word frequency recording unit 307 connected to the input method system for inputting at the user In the process, the user word frequency is recorded, and the user word frequency is frequency information corresponding to the code string input by the user; the user vocabulary 308 is configured to store the word selected by the user, the code string input by the user, and the corresponding user. Word frequency.
  • the input interface unit 301 in the above input method system is most important for providing the user with information input and word selection; and can also be used for switching various modes, for example: input language switching (such as: Simplified and Traditional Chinese, Chinese and English switching), input mode switching (such as: single-word input, word input, sentence input switching), input state switching (such as: text, punctuation, special symbol switching) and so on.
  • input language switching Such as: Simplified and Traditional Chinese, Chinese and English switching
  • input mode switching such as: single-word input, word input, sentence input switching
  • input state switching such as: text, punctuation, special symbol switching
  • the display unit 302 and the system vocabulary 303 are all well known to those skilled in the art and will not be described in detail herein.
  • the input method system shown in FIG. 3 may further include: an application determining unit 309, configured to determine a current application input by the user, and send the determination result to the word frequency recording unit 307; the word frequency recording unit 307 is configured to During the user input process, according to the current application input by the user, the word frequency information is separately counted, and the corresponding weights of the texts are corrected to form a user word frequency. That is, the input method system can respectively perform the corresponding weight assignment and the statistic word frequency according to the current application input by the user. For example, since the preferred method of the present invention can statistically obtain the Internet word frequency, the value is calculated; and the user inputs in the online community forum. Words, because they can be counted from the Internet, can be given a relatively low weight value.
  • FIG. 4 is a structural block diagram of a thesaurus generating apparatus of the present invention, which includes the following components: a collecting unit 401, configured to collect a word record of each user having a new encoded character string, the word record including the word And its corresponding new encoded string.
  • a collecting unit 401 configured to collect a word record of each user having a new encoded character string, the word record including the word And its corresponding new encoded string.
  • the vocabulary generating means can be implemented by a server, and the collecting can be implemented by various methods as described above.
  • the user's word record with the new encoded character string can be obtained by the input method and automatically sent to the collecting unit 401; or can be set or organized by the user and sent to the collecting unit 401; or each user has a new encoding
  • the word records of the string are collected into a fixed network space, and the collecting unit 401 acquires a word record of each user having a new encoded character string from the network space. That is, the word record of the user having the newly encoded character string in the embodiment is not necessarily obtained by the user input behavior, and may be set or organized by the user.
  • a first filtering unit 402 configured to remove duplicate word records
  • the thesaurus generating unit 403 is configured to generate a new thesaurus according to the filtered word records or add the filtered word records to the original thesaurus to obtain a new thesaurus or a new version of the full thesaurus.
  • the vocabulary generating device is further configured to: collect the user word frequency in the user input behavior, the user The word frequency is the frequency information corresponding to the encoded character string input by the user; the cumulative word frequency calculation unit 404 is configured to calculate the user cumulative word frequency corresponding to the encoded character string; and the second filtering unit 405 is configured to remove the user cumulative word frequency less than or equal to the preset.
  • a wide-valued encoded string the statistics of the user's word frequency, preferably, may also be respectively based on the current application input by the user, and the corresponding weights are used.
  • the vocabulary generating device preferably further includes:
  • the Internet page database generating unit 406 is configured to perform weight assignment on the Internet page; and store the Internet page whose weight value is greater than or equal to the preset threshold to the Internet page database.
  • the statistic unit 407 is configured to count the number of occurrences of the words in the filtered word record in the preset Internet page database, and obtain the Internet word frequency.
  • a word frequency allocation unit 408 a user cumulative word frequency for comparing a new encoded character string of the word with a user cumulative word frequency of the original encoded character string, and according to the comparison result, assigning the Internet word frequency to two or more corresponding codes of the word String.
  • the user accumulated word frequency corresponding to the original encoded character string may be obtained by other means, or in the collecting unit 401, the original encoded character string of the word and the corresponding user word frequency information may be simultaneously collected, for each user.
  • the user word frequency is calculated to obtain the user cumulative word frequency.
  • the collecting unit 501 is configured to collect input behavior information of each user, where the input behavior information includes a word selected by a user, and a code input by the user. a character string and a corresponding user word frequency, wherein the user word frequency is frequency information of the user inputting the word and the corresponding encoded character string;
  • the cumulative word frequency calculation unit 502 performs weight correction on the word frequency of each user corresponding to the word and the encoded character string, and calculates the cumulative word frequency of the word and the encoded user as a whole;
  • the thesaurus generating unit 503 is configured to generate a thesaurus, the thesaurus includes a word, an encoded string, and corresponding word frequency information.
  • the thesaurus generating apparatus shown in FIG. 5 may further include: a matching unit 504, configured to compare the generated thesaurus and the existing thesaurus, wherein the existing thesaurus stores words, encoded strings, and The corresponding system word frequency; the new code string determining unit 505 is configured to determine a new code string corresponding to the word according to the preset rule. Then, the thesaurus generating device can implement acquisition of a new encoded string.
  • a matching unit 504 configured to compare the generated thesaurus and the existing thesaurus, wherein the existing thesaurus stores words, encoded strings, and The corresponding system word frequency
  • the new code string determining unit 505 is configured to determine a new code string corresponding to the word according to the preset rule. Then, the thesaurus generating device can implement acquisition of a new encoded string.
  • the vocabulary generating device further includes: a filtering unit 506, configured to remove an encoded character string whose user accumulated word frequency is less than or equal to a preset threshold.
  • the vocabulary generating device preferably further includes:
  • the statistical unit 507 counts the number of occurrences of the word with the new encoded string in the preset Internet page database, and obtains the Internet word frequency.
  • the Internet page database forms a Internet page database by weighting the Internet page and storing the Internet page whose weight value is greater than or equal to the preset threshold.
  • Word frequency allocation unit 508 user cumulative word frequency used to compare the new encoded string of the word with the original The user of the encoded string accumulates the word frequency, and according to the comparison result, allocates the Internet word frequency to two or more corresponding encoded strings of the word.
  • FIG. 6 Another lexicon generating apparatus is shown, which includes the following components:
  • the collecting unit 601 is configured to collect input behavior information of each user, where the input behavior information includes a word selected by the user, a coded string input by the user, and a corresponding user word frequency, where the user word frequency is the user inputting the word and Corresponding to the frequency information of the encoded string;
  • the cumulative word frequency calculation unit 602 performs weight correction on the word frequency of each user corresponding to the word and the encoded character string, and calculates the cumulative word frequency of the word and the encoded user as a whole;
  • the thesaurus generating unit 603 is configured to generate a thesaurus, the thesaurus includes words, encoded strings, and corresponding word frequency information.
  • the comparison unit 604 is configured to compare the generated thesaurus and the existing thesaurus, wherein the existing thesaurus stores the words, the encoded strings, and the corresponding system word frequencies;
  • An expired word determining unit 605 configured to determine an expired word, the expired word is a word that does not exist in the generated thesaurus, but exists in an existing thesaurus; or the generated word
  • the words in the library that accumulate the word frequency according to the preset conditions. For example, the user cumulative word frequency is less than or equal to a predetermined threshold.
  • the existing thesaurus can be streamlined according to these expired words to prevent the existing thesaurus from getting bigger and bigger, for example, filtering from the existing thesaurus,
  • the expired words are deleted, thereby reducing the volume of the thesaurus, improving the utilization of the thesaurus, and improving the input efficiency.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Document Processing Apparatus (AREA)

Abstract

L'invention concerne un procédé permettant d'obtenir une nouvelle chaîne de caractères codés et un procédé de saisie de mot comprenant : extraction du mot sélectionné par l'utilisateur dans le processus de saisie et des chaînes de caractères codés saisies par l'utilisateur, comparaison du mot sélectionné par l'utilisateur et des chaînes de caractères codés saisies par l'utilisateur avec la présente base de mots, la présente base de mots stocke le mot et la chaîne de caractères codés correspondant au mot, détermination de la nouvelle chaîne de caractères codés correspondant au mot conformément à la règle préfixée.
PCT/CN2007/070518 2006-08-23 2007-08-20 Procédés permettant d'obtenir une nouvelle chaîne de caractères codés, système et procédé de saisie et dispositif de génération de base de mots WO2008028421A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CNB2006101115629A CN100424703C (zh) 2006-08-23 2006-08-23 获取新编码字符串的方法及输入法系统、词库生成装置
CN200610111562.9 2006-08-23

Publications (1)

Publication Number Publication Date
WO2008028421A1 true WO2008028421A1 (fr) 2008-03-13

Family

ID=37778551

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2007/070518 WO2008028421A1 (fr) 2006-08-23 2007-08-20 Procédés permettant d'obtenir une nouvelle chaîne de caractères codés, système et procédé de saisie et dispositif de génération de base de mots

Country Status (2)

Country Link
CN (1) CN100424703C (fr)
WO (1) WO2008028421A1 (fr)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109426356A (zh) * 2017-09-01 2019-03-05 百度在线网络技术(北京)有限公司 信息输入方法和装置
CN112732098A (zh) * 2019-10-12 2021-04-30 北京搜狗科技发展有限公司 一种输入的方法及相关装置

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101286118B (zh) * 2007-04-10 2012-04-18 北京搜狗科技发展有限公司 一种快速调用程序指令的方法、系统
CN100483416C (zh) * 2007-05-22 2009-04-29 北京搜狗科技发展有限公司 一种字符输入的方法、输入法系统及词库更新的方法
CN100483417C (zh) * 2007-05-25 2009-04-29 北京搜狗科技发展有限公司 获取限制词信息的方法、优化输出的方法和输入法系统
CN101727271B (zh) * 2008-10-22 2012-11-14 北京搜狗科技发展有限公司 一种提供纠错提示的方法、装置及输入法系统
CN103064825B (zh) * 2011-10-18 2016-03-02 阿里巴巴集团控股有限公司 模糊音对建立、设置方法和输入法及其装置和系统
CN103870001B (zh) * 2012-12-11 2018-07-10 百度国际科技(深圳)有限公司 一种生成输入法候选项的方法及电子装置
CN103869998B (zh) * 2012-12-11 2018-05-01 百度国际科技(深圳)有限公司 一种对输入法所产生的候选项进行排序的方法及装置
CN103870000B (zh) * 2012-12-11 2018-12-14 百度国际科技(深圳)有限公司 一种对输入法所产生的候选项进行排序的方法及装置
CN105892710B (zh) * 2015-01-20 2021-10-15 张龙哺 基于文本框的汉字输入方法及其装置
CN106249914A (zh) * 2016-08-03 2016-12-21 太仓美宅姬娱乐传媒有限公司 一种文字输入方法及其系统
CN107247519B (zh) * 2017-08-16 2020-09-29 北京搜狗科技发展有限公司 一种输入方法及装置

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1494025A (zh) * 2002-10-31 2004-05-05 英业达股份有限公司 具有分类词库的中文输入方法及其系统
US20050251384A1 (en) * 2004-05-05 2005-11-10 Microsoft Corporation Word extraction method and system for use in word-breaking

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060048055A1 (en) * 2004-08-25 2006-03-02 Jun Wu Fault-tolerant romanized input method for non-roman characters

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1494025A (zh) * 2002-10-31 2004-05-05 英业达股份有限公司 具有分类词库的中文输入方法及其系统
US20050251384A1 (en) * 2004-05-05 2005-11-10 Microsoft Corporation Word extraction method and system for use in word-breaking

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109426356A (zh) * 2017-09-01 2019-03-05 百度在线网络技术(北京)有限公司 信息输入方法和装置
CN109426356B (zh) * 2017-09-01 2022-07-15 百度在线网络技术(北京)有限公司 信息输入方法和装置
CN112732098A (zh) * 2019-10-12 2021-04-30 北京搜狗科技发展有限公司 一种输入的方法及相关装置

Also Published As

Publication number Publication date
CN1920827A (zh) 2007-02-28
CN100424703C (zh) 2008-10-08

Similar Documents

Publication Publication Date Title
WO2008028421A1 (fr) Procédés permettant d'obtenir une nouvelle chaîne de caractères codés, système et procédé de saisie et dispositif de génération de base de mots
CN108304375B (zh) 一种信息识别方法及其设备、存储介质、终端
WO2008022581A1 (fr) Procédé et dispositif d'obtention de mots nouveaux et système et procédé de saisie
JP6675463B2 (ja) 自然言語の双方向確率的な書換えおよび選択
US7983902B2 (en) Domain dictionary creation by detection of new topic words using divergence value comparison
US10176803B2 (en) Updating population language models based on changes made by user clusters
US8463598B2 (en) Word detection
KR101498001B1 (ko) 디스플레이를 위한 고품질 리뷰 선택
WO2008014702A1 (fr) Procédé et système d'extraction de mots nouveaux
WO2020253350A1 (fr) Procédé et appareil de vérification de publication de contenu de réseau, dispositif informatique et support de stockage
CN101208689B (zh) 创建语言模型和假名-汉字转换的方法和设备
WO2007143914A1 (fr) Procédé, dispositif et système de saisie pour la création d'une base de données de fréquence de mots basée sur des informations issues du web
WO2007114563A1 (fr) Système et procédé pour fournir à chaque utilisateur un mot recommandé d'ajustement, support d'enregistrement lisible par ordinateur et programme d'enregistrement pour mettre en oeuvre ledit procédé
US10162812B2 (en) Natural language processing system to analyze mobile application feedback
CN106528532A (zh) 文本纠错方法、装置及终端
JP2001273283A (ja) 言語を識別しかつ音声再生装置を制御する方法及び通信デバイス
CN101382946A (zh) 信息处理设备、信息处理方法和程序
CN111767393A (zh) 一种文本核心内容提取方法及装置
CN107690634A (zh) 自动查询模式生成
JP6830226B2 (ja) 換言文識別方法、換言文識別装置及び換言文識別プログラム
CN112562659A (zh) 语音识别方法、装置、电子设备和存储介质
CN111488453B (zh) 资源分级方法、装置、设备及存储介质
JP2010092357A (ja) 施設関連情報検索方法および施設関連情報検索システム
KR101923820B1 (ko) 개인 맞춤형 문장 자동추천 외국어 학습시스템
WO2010142422A1 (fr) Procédé pour la communication électronique interlangue

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 07800994

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

NENP Non-entry into the national phase

Ref country code: RU

122 Ep: pct application non-entry in european phase

Ref document number: 07800994

Country of ref document: EP

Kind code of ref document: A1