CN101013443A - Intelligent word input method and input method system and updating method thereof - Google Patents

Intelligent word input method and input method system and updating method thereof Download PDF

Info

Publication number
CN101013443A
CN101013443A CNA2007100792674A CN200710079267A CN101013443A CN 101013443 A CN101013443 A CN 101013443A CN A2007100792674 A CNA2007100792674 A CN A2007100792674A CN 200710079267 A CN200710079267 A CN 200710079267A CN 101013443 A CN101013443 A CN 101013443A
Authority
CN
China
Prior art keywords
combined information
words
input method
internet
corpus
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CNA2007100792674A
Other languages
Chinese (zh)
Other versions
CN100458795C (en
Inventor
郭奇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Sogou Technology Development Co Ltd
Original Assignee
Beijing Sogou Technology Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Family has litigation
First worldwide family litigation filed litigation Critical https://patents.darts-ip.com/?family=38700955&utm_source=google_patent&utm_medium=platform_link&utm_campaign=public_patent_search&patent=CN101013443(A) "Global patent litigation dataset” by Darts-ip is licensed under a Creative Commons Attribution 4.0 International License.
Application filed by Beijing Sogou Technology Development Co Ltd filed Critical Beijing Sogou Technology Development Co Ltd
Priority to CNB2007100792674A priority Critical patent/CN100458795C/en
Publication of CN101013443A publication Critical patent/CN101013443A/en
Priority to PCT/CN2008/070270 priority patent/WO2008098507A1/en
Application granted granted Critical
Publication of CN100458795C publication Critical patent/CN100458795C/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/02Input arrangements using manually operated switches, e.g. using keyboards or dials
    • G06F3/023Arrangements for converting discrete items of information into a coded form, e.g. arrangements for interpreting keyboard generated codes as alphanumeric codes, operand codes or instruction codes
    • G06F3/0233Character input methods
    • G06F3/0237Character input methods using prediction or retrieval techniques

Abstract

The invention discloses an intelligent word-group input method in input system, including get the portfolio information between at least two basic words from the pre-set internet corpus, which including the combination relationship and adjacent same frequency of (the least two) words; generate multi-table according to the portfolio information; receive the user input encoding string, and segment the encoding string; get the combination information in the multi-table according to segmentation string, and extract the words with the corresponding relations in combination information as the candidate words. The invention can improve the first selected words hit rate of user input words, phrases, short sentences or long sentences to avoid ineffective repeat calculation process to improve the efficiency of the user input.

Description

A kind of method and a kind of input method system and update method thereof of intelligent word input
Technical field
The present invention relates to the input method system data processing field, particularly relate to a kind of method, a kind of input method system, a kind of device and a kind of method of upgrading input method system that generates multivariate table of intelligent word input.
Background technology
All there are the corresponding a plurality of candidate's word problems of same-code inevitably in current input method system (comprising Chinese, Japanese and Korean or the like), with the spelling input method is example, as: phonetic adds input method, purple light China space spelling input method etc., this existing input method all is based on the ordering that word frequency (usage frequency of words) in its dictionary and the dictionary comes to provide in the information input process for the user candidate word, the preferential the highest words commonly used of word frequency, the promptly first-selected speech of showing.The ordering of candidate word is an important indicator of user's first-selected speech hit rate height in the information input process.Described first-selected speech hit rate is meant that after the user imported certain keypad information, sorting the preceding, word, speech or sentence were that the user needs most.For example, input Pinyin " guan xi tui 1i " (relation inference), described existing input method can be obtained candidate word all in the dictionary according to phonetic " guan xi ", as " relation ", " washing " and " Northwest " etc., show preferentially that then the highest everyday words of word frequency " relation " is first-selected speech, simultaneously, obtaining the speech that word frequency is the highest in the dictionary " reasoning " according to " tui li " is first-selected speech, forms " relation inference " and offers user's input.In this example, the hit rate of first-selected speech is 100%, promptly meets user's needs fully.
Certainly, say technically, input method system itself can't know which words is that the user needs most, but in vast as the open sea Chinese words, the use of each words is different with the frequency of occurrences, the ordering of words that the frequency of occurrences is higher is in the preceding first-selected speech hit rate that just can improve input method system greatly, promptly can satisfy the possibility of user's needs from sort words the preceding of probability raising.
Yet, if the needed words of user does not also correspond to the highest words of word frequency, for example, user's input " zi zhu xue xiao " (subsidizing school), and input method gets access to the highest speech of word frequency accordingly and is " autonomous school ", in this case, just need the user in all candidate word, to select " subsidy ", to obtain needed result.In practice, the user adopts existing input method more much higher than the probability that directly gets access to effective first-selected speech by the result's of selection candidate word acquisition needs probability, this just shows, the first-selected speech hit rate of existing input method is not high, thereby cause user's input speed to slow down, input efficiency reduces, poor user experience.
For the problems referred to above, prior art has proposed following two kinds of solutions:
First kind, increase the words in the described input method dictionary;
In this case, need to increase abundant words in the dictionary of described input method, just can reach corresponding effects.For example, if the user wants input " intelligent word ", then must in dictionary, store " intelligence ", " group speech " and " intelligent word " three speech, even may also must store " smart group " this speech that does not have concrete implication.For phrase or sentence that some a plurality of speech are formed, the speech of required interpolation will be more.So, dictionary can be more and more too fat to move, can take more space, waste more resources simultaneously.
Second kind, application NLP (Nature Language Process natural language processing) technology.
In described input method system, use this technology and can pass through part of speech, modes such as syntactic analysis improve first-selected speech hit rate, for example, Microsoft's spelling input method has just been used a kind of NLP technology, this technology combines N-gram statistical language model and language rule and instructs phonetic to flow to the transfer process of word flow jointly, main grammatical and semantic taxonomic hierarchies according to " detailed annotation of Modern Chinese syntactic information dictionary " and " synonym speech woods ", by summing up the grammatical and semantic between the various parts of speech, and the rule of the collocation between these parts of speech of human-edited and corresponding attribute word finder embody.
Yet, realizing that this input method system needs those skilled in the art to analyze and edit based on fixing corpus, the technology implementation procedure is complicated and loaded down with trivial details; And described fixedly corpus can not upgrade arbitrarily, causes poor user experience; In addition, this input method system need take bigger space, and such as Microsoft's spelling input method, the size of its installation kit has just surpassed 70,000,000, uses threshold higher, waste user's system resource.
Therefore, present stage needs the urgent technical matters that solves of those skilled in the art to be exactly, and how under situation about economizing on resources as much as possible, the raising input method system is for the first-selected speech hit rate of a plurality of words, phrase, phrase, short sentence or long sentence.
Summary of the invention
Technical matters to be solved by this invention provides a kind of method and a kind of input method system of intelligent word input, to solve problems such as the first-selected speech hit rate for a plurality of words, phrase, phrase, short sentence or long sentence is not high in the prior art, resource occupation is too much.
Another object of the present invention has provided a kind of method and a kind of method of upgrading input method system that generates multivariate table, to guarantee accuracy, the representativeness and comprehensive of output words, import the first-selected speech hit rate of a plurality of words, phrase, phrase, short sentence or long sentence thereby improve the user, and then effectively improved user's input efficiency.
In order to solve the problems of the technologies described above, the embodiment of the invention discloses a kind of method of intelligent word input, comprising:
From the internet corpus that presets, it is adjacent with existing combined information to obtain at least two basic words, and described combined information comprises collocation relation and the adjacent frequency that shows together between described at least two basic words; Generate multivariate table according to described combined information; Receive the coded string of user's input, and described coded string is carried out cutting; In described multivariate table, obtain corresponding combined information according to the coded string after the described cutting, and the corresponding words that extracts corresponding collocation relation in the described combined information is a prepare word.
Preferably, described method also comprises: the adjacent word frequency with existing words in the dictionary that shows frequency and input method system according to described prepare word is calculated together probability now, and, sort with showing probability according to described, and ranking results is exported as candidate item.
Preferably, described multivariate table stores with showing probability, described is to calculate acquisition according to the adjacent word frequency with existing words in the dictionary that shows frequency and input method system of described at least two basic words with showing probability, described method also comprises: the word frequency according to existing words in the dictionary of the same existing probability of described prepare word and input method system is calculated weighted value, and, sort according to described weighted value, and ranking results is exported as candidate item.
Preferably, described multivariate table stores the strength of joint value, described strength of joint value is to obtain with showing frequency and showing probability calculation together according to the adjacent of described at least two basic words, described method also comprises: the word frequency according to existing words in the dictionary of the strength of joint value of described prepare word and input method system is calculated weighted value, and, sort according to described weighted value, and ranking results is exported as candidate item.
Preferably, described method also comprises: choose the basic words that meets prerequisite from the dictionary of input method system.
Preferably, described method before generating multivariate table, also comprises:
If the adjacent frequency that shows together in the combined information is lower than certain threshold value, then remove this combined information;
The highest words is formed if the corresponding words in the combined information is two or more word frequency, then removes this combined information;
If one combined information is partly or entirely covered by another combined information, then remove this combined information.
Preferably, preset described internet corpus by following steps: obtain pages of Internet by the web crawlers technology; Choose the info web that meets prerequisite, and preserve formation internet corpus.
Preferably, described internet corpus is internet blog corpus, internet news corpus and/or internet forum corpus.
Preferably, described method also comprised step: described multivariate table is loaded in the memory device before the coded string that receives user's input.
Preferably, described method also comprises: the cutting method to described coded string is optimized.
Preferably, described method also comprises: the coded string that increases newly according to the user obtains corresponding combined information in described multivariate table.
The embodiment of the invention also provides a kind of input method system, and described input method system comprises input interface unit and display unit, also comprises multivariate table: described multivariate table is adjacent with existing combined information generation by at least two basic words; Described combined information obtains from the internet corpus that presets, and comprises collocation relation and the adjacent frequency that shows together between described at least two basic words; Cutting unit: be used for the coded string of user's input is carried out cutting; Extraction unit: be used for obtaining corresponding combined information at described multivariate table, and the corresponding words that extracts corresponding collocation relation in the described combined information is a prepare word according to the coded string after the described cutting.
Preferably, described input method system also comprises first output unit: be used for calculating together probability now according to the adjacent word frequency with the existing words of the dictionary that shows frequency and input method system of described prepare word, and, sort with showing probability according to described, and ranking results is exported as candidate item.
Preferably, described multivariate table stores with showing probability, described is to calculate acquisition according to the adjacent word frequency with existing words in the dictionary that shows frequency and input method system of described at least two basic words with showing probability, described input method system also comprises second output unit: be used for calculating weighted value according to the same word frequency that shows the existing words of dictionary of probability and input method system of described prepare word, and, sort according to described weighted value, and ranking results is exported as candidate item.
Preferably, described multivariate table stores the strength of joint value, described strength of joint value is to obtain with showing frequency and showing probability calculation together according to the adjacent of described at least two basic words, described input method system also comprises the 3rd output unit: be used for calculating weighted value according to the word frequency of the existing words of dictionary of the strength of joint value of described prepare word and input method system, and, sort according to described weighted value, and ranking results is exported as candidate item.
Preferably, described internet corpus is internet blog corpus, internet news corpus and/or internet forum corpus.
Preferably, described input method system also comprises loading unit: be used for described multivariate table is loaded on memory device.
Preferably, described input method system also comprises cutting optimization unit: be used for the cutting method of described coded string is optimized.
Preferably, described input method system also comprises newly-increased acquiring unit: the coded string that is used for increasing newly according to the user obtains corresponding combined information at described multivariate table.
Preferably, the input interface unit of described input method system, display unit and multivariate table are arranged in same computing equipment; Perhaps, the input interface unit of described input method system, display unit are arranged in first computing equipment, multivariate table is arranged in second computing equipment, described input method system is according to the information of user's input, obtain corresponding combined information from the multivariate table that is arranged in second computing equipment, show corresponding words at first computing equipment.
The embodiment of the invention also provides a kind of device that generates multivariate table, comprise acquisition module: be used for from the internet corpus that presets, it is adjacent with existing combined information to obtain at least two basic words, and described combined information comprises collocation relation and the adjacent frequency that shows together between described at least two basic words; Generation module: be used for generating multivariate table according to described combined information.
Preferably, described device also comprises and chooses module: be used for choosing the basic words that meets prerequisite from the dictionary of input method system.
Preferably, described device comprises that also first removes module: be used for removing this combined information when the adjacent of a combined information when frequency is lower than certain threshold value with showing;
And/or second removes module: be used for removing when the corresponding words of a combined information is formed for the highest words of two or more word frequency this combined information;
And/or the 3rd removes module: be used for removing when a combined information is partly or entirely covered by another combined information this combined information.
Preferably, described device also comprises the webpage acquisition module: be used for obtaining pages of Internet by the web crawlers technology; Corpus generation module: be used to choose the info web that meets prerequisite, and preserve formation internet corpus.
Preferably, described internet corpus is internet blog corpus, internet news corpus and/or internet forum corpus.
The embodiment of the invention also provides a kind of method of upgrading input method system, comprising: upgrade the internet corpus; From the internet corpus that presets, it is adjacent with existing combined information to obtain at least two basic words, and described combined information comprises collocation relation and the adjacent frequency that shows together between described at least two basic words; Generate multivariate table according to described combined information; Described multivariate table is sent to input method system.
Preferably, described method also comprises: choose the basic words that meets prerequisite from the dictionary of input method system.
Preferably, described method before generating multivariate table, also comprises: if the adjacent frequency that shows together in the combined information is lower than certain threshold value, then remove this combined information; The highest words is formed if the corresponding words in the combined information is two or more word frequency, then removes this combined information; If one combined information is partly or entirely covered by another combined information, then remove this combined information.
Compared with prior art, the present invention has the following advantages:
At first, because the present invention is the basis of input method system output words to preset in the internet corpus, can accurately reflect the trend of people on language uses, can guarantee accuracy, the representativeness, comprehensive of combined information, import the first-selected speech hit rate of a plurality of words, phrase, phrase, short sentence or long sentence thereby improve the user, and then effectively improved user's input efficiency.
Next, the present invention is by generating the main channel of multivariate table as the output words, and technology realizes simply, does not have special secret algorithm, and can effectively avoid invalid, repeat calculation process, helps to economize on resources and raise the efficiency;
Moreover internet of the present invention corpus can be provided with arbitrarily, upgrade or be changed by those skilled in the art, thereby can obtain different intelligent word versions, to satisfy various users' different demands;
In addition, the present invention selects effective combined information to generate multivariate table by filtering rule is set, and can also avoid the redundancy of multivariate table, has effectively saved system resource;
At last, the present invention has also used multiple optimisation strategy, and to avoid invalid, the repeat calculation process of system, mitigation system is born, thereby has effectively improved user's input efficiency.
Description of drawings
Fig. 1 is the process flow diagram of embodiment of the invention method of intelligent word input in input method system;
Fig. 2 is the process flow diagram of the preferred embodiment of a kind of intelligent word input method of the present invention;
Fig. 3 is the structured flowchart of a kind of input method system embodiment of the present invention;
Fig. 4 is a kind of structured flowchart that generates the device embodiment of multivariate table of the present invention;
Fig. 5 is the process flow diagram that the device of application generation multivariate table shown in Figure 4 generates the preferred embodiment of multivariate table;
Fig. 6 is the process flow diagram that the present invention upgrades the embodiment 1 of input method system;
Fig. 7 is the process flow diagram that the present invention upgrades the embodiment 2 of input method system.
Embodiment
For above-mentioned purpose of the present invention, feature and advantage can be become apparent more, the present invention is further detailed explanation below in conjunction with the drawings and specific embodiments.
With reference to Fig. 1, be the process flow diagram of embodiment of the invention method of intelligent word input in input method system, may further comprise the steps:
Step 101, from the internet corpus that presets, it is adjacent with existing combined information to obtain at least two basic words;
Wherein, described combined information comprises collocation relation and the adjacent frequency that shows together between described at least two basic words;
Step 102, generate multivariate table according to described combined information;
Step 103, receive the coded string of user's input, and described coded string is carried out cutting;
Step 104, in described multivariate table, obtain corresponding combined information, and the corresponding words that extracts corresponding collocation relation in the described combined information is a prepare word according to the coded string after the described cutting.
Along with the quickening of social rhythm, constantly carrying out of cultural conflict and fusion causes the use of many vocabulary in the modern society, adopt the existing fixed corpus can not contain far away, especially along with the popularizing of internet, cause information to expand rapidly, the problems referred to above are more and more outstanding.Because fixedly the scale of corpus is less, content is fixed, the time that forms early, upgrade very slow, do not meet the activity that the internet uses according to its word frequency that gets, for example: the frequency that internet common wordss such as " top ", " network game ", " financial report " use is quite high, but in the prior art, after the general ordering of these vocabulary is all relatively leaned on, need the such demand of frequent use not to be inconsistent with the user.
In this case, present embodiment is by from preset the internet corpus, it is adjacent with existing combined information to obtain at least two basic words, promptly the statistics that is multivariate table with internet information open, real-time change is originated, when user's input information, the words that frequency of utilization is high on the internet can become the first-selected speech or the homepage candidate word of user's input, thereby improves user's input speed and efficient.
Those skilled in the art can preset described internet corpus as required arbitrarily, and for example, presetting described internet corpus is internet blog corpus, internet news corpus or internet forum corpus or the like.Be understandable that, can obtain different combined informations, thereby the output result who uses this input method system to obtain also might be different according to different internet corpus.Preferably, described internet corpus can also be changed, to satisfy various users' different demands.
Described basic words can derive from the collection of document (for example, traditional news, newspaper) of various specific sealings, and those skilled in the art select arbitrarily to get final product in practice as required.Preferably, from the dictionary of input method system, obtain described basic words.Possible is, although the dictionary of input method system comprises very huge word information, in fact only some is effective words, that is, and and the speech that usage frequency is higher and commonly used.Some is uncommon words or the low-down words of usage frequency.If calculate, obviously, can cause problems such as calculated amount is excessive, double counting is too much based on all basic words in the dictionary of input method system.
Need to prove, the dictionary of described input method system can be any dictionary of the prior art or its combination herein, also can be any dictionary that obtains according to presetting rule by those skilled in the art, and the memory location of described dictionary, for example, be present in server end or client, the present invention need not limit this.The system's dictionary, User Defined dictionary, general dictionary, specialized dictionary etc. that are appreciated that input method system described in the prior art are all within the dictionary scope of input method system of the present invention.
Therefore, preferably, present embodiment can also comprise step: choose the basic words that meets prerequisite from the dictionary of described input method system.For example, choose the words of TOP60000 in the dictionary of described input method system from high to low according to word frequency.Carry out subsequent treatment based on the words of choosing, can effectively avoid invalid, repeat calculation process, help to economize on resources and raise the efficiency.
Yet, the combined information that obtains based on the basic words of described screening still might comprise redundant or invalid combined information, for example, adjacent combined information, has the combined information of repetition implication or the combined information that partly or entirely covered etc. with underfrequency now, preferably, before generating multivariate table, present embodiment also comprises some optimization step, will describe in detail these optimization step hereinafter.
Need to prove that one of core idea that generates described multivariate table is: combined information according to after certain redundant rule elimination low value information, is kept the part of high value information as multivariate table.The multivariate table that generates according to combined information is meant that the variable of row or column is two or more tables.The form of described multivariate table can be as shown in the table:
Download-people 0.0065
A lot-wooden horse 1.6596
Sky-be covered with 18.3775
Piece-guided missile 46.1310
Come more-have more 532276
Because of worker-disable 347646.3438
In last table, the collocation relation between a plurality of words is shown in first tabulation, and secondary series is represented the connection parameter of this words collocation, and described connection parameter comprises adjacent with frequency now, with now probability or strength of joint value etc.Wherein, the described adjacent frequency that shows together can be added up acquisition from the internet corpus that presets, described can calculating by the adjacent word frequency with existing words in now frequency and the dictionary of described at least two basic words with probability now obtains, and described strength of joint value can be by obtaining with frequency now with probability calculation now according to the adjacent of described at least two basic words.Certainly, described connection parameter can be any numerical value that shows annexation between the words, and the present invention is not limited this, and in addition, the form of described multivariate table can be provided with arbitrarily as required, and the present invention does not need this to limit yet.
In practice, described multivariate table packing can also be stored in the described input method system, install and use to this locality in order to offer user's download.Those skilled in the art can be as required or are rule of thumb selected any storage mode to store, and this present invention is not limited.For example, described combined information and weighted value thereof are stored in the file according to the word order increment, wherein, described weighted value can be according to adjacent with showing frequency configuration, and adjacent with now frequency is big more, then this weighted value is big more.Use universal compressed algorithm then, as RAR compression algorithm, ZIP compression algorithm etc. with as described in file packing be stored to as described in the input method system.
Before the coded string that receives user's input, preferably, present embodiment can also comprise step: described multivariate table is loaded in the memory device.In this case,, then described multivariate table can be loaded in the internal memory if the user starts described input method system at local computer, thus the application performance of raising input method system.In case after loading, subsequently the read operation of data is all carried out in internal memory, be need not hard disk operation, thereby can effectively improve user's input speed and efficient.If input method system is the input method in network system, when the user uses, then described multivariate table can be loaded in the memory device of server, subsequently the read operation of data is all carried out based on the memory device of this server.
When the user used this input method system, this input method system can carry out cutting to the coded string of user's input, and described cutting can realize that the present invention does not need this to limit by adopting arbitrary cutting method of the prior art.
Preferably, present embodiment can also adopt some optimisation strategy that described input method system is optimized.Below be that example describes with several preferred optimisation strategy.
Optimisation strategy A: the cutting method to described coded string is optimized.For example, adopt branch and bound method that cutting method is carried out beta pruning.
The principle of work of branch and bound method is: at first determine the bound of desired value, cut some branch of search tree while searching for, improve search efficiency.Be applied in the embodiments of the invention, for a coded string, the method for a variety of cuttings arranged, for each cutting method, each coding also has the selection of a variety of possible words, if all calculate, calculated amount will be an astronomical figure.In this case, adopt described branch and bound method that the cutting method of each possible words is carried out probability calculation,, just stop current calculating, select a kind of down possibility if find that the possibility of this cutting method optimum is very little.Can effectively reduce calculated amount by described optimisation strategy A, the assurance system at the appointed time exports the result in the scope, thereby has effectively improved the treatment effeciency of system.
Certainly, those skilled in the art can be as required or are rule of thumb preset various optimisation strategy, and this present invention is not limited.
Preferably, can also comprise step in the present embodiment: the adjacent word frequency with existing words in the dictionary that shows frequency and input method system according to described prepare word is calculated together probability now, and, sort with showing probability according to described, and ranking results is exported as candidate item.Certainly, described ordering can also add other condition as required, and this present invention is not needed to limit.
Below with a kind of preferred be example with method for calculating probability now:
P(w 1,w 2,w 3,...,w n)=P(w 1)*P(w 2)*...*P(w n)*P(w 1,w 2)*P(w 2,w 3)*...*P(w n-1,w n);
Wherein, w nBe a basic words, P (w n) be the probability of this basis words, P (w N-1, w n) be the probability of two collocation relations between the adjacent foundation words.Can learn that present embodiment will consider the collocation relation between any two adjacent basic words for basic words above on two or two, calculates the product of all probability then.
For example, for two basic words A and B, then it is the probability of A, the probability of B and the product of the probability that AB occurs simultaneously with showing probability; For three basic words A, B and C, then it is the product of the probability of A, AB, B, BC, C with showing probability.
Above-mentioned algorithm is a statistics with a kind of algorithm of probability now, and those skilled in the art also can adopt other method with experience as required, as the method for direct storage N variable matrix etc.Said method only is used for for example, and the present invention is not limited to above-mentioned several method.
As another embodiment, when storing in the described multivariate table with probability now, present embodiment can comprise step: the word frequency according to existing words in the dictionary of the same existing probability of described prepare word and input method system is calculated weighted value, and, sort according to described weighted value, and ranking results is exported as candidate item.Wherein, preferably, described is to calculate acquisition according to the adjacent word frequency with existing words in the dictionary that shows frequency and input method system of described at least two basic words with showing probability, the described preparation method that shows probability together can adopt the method in the example, can adopt other method of the prior art, the present invention does not limit this yet.
As another embodiment, when storing the strength of joint value in the described multivariate table, present embodiment can comprise step: the word frequency according to existing words in the dictionary of the strength of joint value of described prepare word and input method system is calculated weighted value, and, sort according to described weighted value, and ranking results is exported as candidate item.Wherein, preferably, described strength of joint value is to obtain with showing frequency and showing probability calculation together according to the adjacent of described at least two basic words.
Certainly, can also store in the described multivariate table other any show the numerical value of annexation between the words, those skilled in the art rule of thumb maybe need to select for use to get final product, the present invention is not limited this.
A kind of possible situation is, the user is newly-increased input coding character string on the basis of the coded string of original input, at this situation, present embodiment can also optimizing application strategy B: only the coded string that increases newly according to the user obtains corresponding combined information in described multivariate table; Make the calculating of system only limit to change part, avoid system's repetitive operation.For example, user's input Pinyin coded string zhongguorenminjiefang (Chinese people's liberation), user's this moment input alphabet j again, the letter " j " that then adopts optimisation strategy B only to increase newly according to the user obtains corresponding combined information (as " army ", " monarch ", " machine " etc.) in described multivariate table, and need not to repeat to obtain the corresponding combined information of Pinyin coding character string " zhongguorenminjiefang " of front again.
In order to improve the effective rate of utilization of input method system, the all right optimizing application strategy C of present embodiment: the computing time of preset system, as 100ms or 50ms, in described preset time, finish calculating in order to control system, calculate as yet and finish if surpassed described preset time system, will finish then that screen shows on the result of calculation of part.For example, user's input Pinyin coded string " renshengzigushuiwusi ", when surpassing 50ms, input method system of the present invention only gets access to " renshengzigushuiwu " corresponding prepare word and is " life does not have whom ", " life ", " voice ", " shy with strangers " etc. from ancient times, but the calculating for " si " is not finished as yet, when using described optimisation strategy C, input method system then of the present invention only will screen demonstration on the above-mentioned prepare word that has got access to.One of core concept of this processing mode is the control of the background process of input method system and foreground separated and handles, so just can guarantee to be installed in described input method system on the different machines or the different loads of uniform machinery under effect be the same.
Preferably, described optimisation strategy A, B and C are used in combination in input method system.Certainly, those skilled in the art can only adopt a kind of optimisation strategy, also can adopt multiple; In multiple optimisation strategy, can carry out combination in any.In addition, those skilled in the art can also set up other various optimisation strategy as required on their own, and the present invention does not limit this.
For the memory source that makes described input method system be convenient to Network Transmission, reduces the user takies and improves system handles efficient, the words in the dictionary of combined information in the described multivariate table and described input method system can also be compared in the present embodiment; If there is the words that repeats with described combined information in the described dictionary, then in the input method dictionary, remove this words.For example, for phonetic shangwuhuiyi, corresponding combined information is: " business meetings ", " meeting in the morning ", " memory at noon ", " memory in the morning ", " commercial affairs are understanding " etc.; If the words that a correspondence arranged in the dictionary of input method system is " business meetings ", then with combined information in " business meetings " repeat, in this case, can remove " business meetings " in the dictionary.
With reference to figure 2, be the process flow diagram of the preferred embodiment of a kind of intelligent word input method of the present invention, comprise and preset step and input step, specifically, comprising:
One, preset step:
Step 201, obtain pages of Internet by the web crawlers technology;
For example,,, grasp in the internet nearly 4,000,000,000 up-to-date webpage in real time, can comprise Internet news in these internet web pages, forum, blog, chatroom or the like Web content according to website domain name tabulation by tens web crawlers servers.
Step 202, choose the info web that meets prerequisite, and preserve and form the internet corpus;
For example, select 4,000 ten thousand internet web pages, the mass network page corpus that original language material scale surpasses 1Terabyte is a described internet corpus.
Because present embodiment serves as the basis of output words with open, internet information real-time change, the multivariate table that generates can accurately reflect the trend of people on language uses, can guarantee accuracy, the representativeness and comprehensive of combined information, import the first-selected speech hit rate of a plurality of words, phrase, phrase, short sentence or long sentence thereby improve the user, and then effectively improved user's input efficiency.
Certainly, those skilled in the art can be as required or are rule of thumb selected any method to preset described internet corpus, and this present invention is not limited.And the described method that presets the internet corpus also can for example, be updated to news corpus, blog corpus or forum's corpus etc. with described internet corpus for upgrading the method for internet corpus, and the present invention does not also limit this.
Step 203, from the dictionary of input method system, choose the basic words that meets prerequisite;
For example, choose the words of TOP60000 in the dictionary of described input method system from high to low according to word frequency.
Step 204, from the internet corpus that presets, it is adjacent with existing combined information to obtain at least two basic words;
Wherein, described combined information comprises collocation relation and the adjacent frequency that shows together between described at least two basic words;
If the adjacent frequency that shows together in step 205 combined information is lower than certain threshold value, then remove this combined information;
For example, if the adjacent of a combined information is lower than 0.001 combined information with showing frequency, then remove this combined information.Removing the adjacent lower threshold value of frequency that shows does not together influence user's general operation, but can save system resource well, and mitigation system is born, thereby effectively improves the treatment effeciency of system.
The highest words is formed if the corresponding words in step 206 combined information is two or more word frequency, then removes this combined information;
For example, for phonetic: qinghuadaxuebiye; The combined information that gets access to is: Tsing-Hua University's graduation, however the first-selected speech for phonetic " qinghua " correspondence is " Tsing-Hua University " in the dictionary of input method system; The first-selected speech corresponding for phonetic " daxue " is " university "; The first-selected speech corresponding for phonetic " biye " is " graduation ", in this case, even this combined information does not exist, also can not influence its first-selected effect, therefore this combined information can be removed.
If combined information of step 207 is partly or entirely covered by another combined information, then remove this combined information;
For example, for phonetic: wohenkaixin; The combined information that gets access to is: I am very happy, if for phonetic: henkaixin, an existing combined information is: very happy; Because phonetic " wo " existing first-selected speech in the dictionary of input method system is " I ", can draw so, " very happy " this combined information can partly cover " I am very happy " this combined information, in this case, even " I am very happy " this combined information does not exist, can not influence its first-selected effect yet, therefore this combined information can be removed.Be understandable that, also can remove for the combined information that repeats fully.
By above-mentioned steps 205-step 207, can effectively avoid redundant information and invalid information in the combined information, help the effective rate of utilization of mitigation system burden, saving system space and resource, raising system.
Need to prove that above-mentioned steps 205-step 207 can be provided with separately or the combination in any setting as required, that is to say that those skilled in the art can only adopt an independent step, also can adopt a plurality of steps; In above-mentioned steps, can carry out combination in any, there is not sequence limit yet.In addition, those skilled in the art can also set up other various presetting rule as required on their own, and the present invention does not limit this.For example, other possible selection rule is: the removal string length is less than or equal to the combined information (user is not intended to input etc.) of preset threshold value etc.
The combined information that step 208, basis filter out generates multivariate table.
Two, input step:
Step 209, described multivariate table is loaded in the memory device;
Step 210, receive the coded string of user's input, and described coded string is carried out cutting;
Can also optimize the unit by cutting at this cutting method of described coded string is optimized, for example, adopt branch and bound method that cutting method is carried out beta pruning.
Step 211, in described multivariate table, obtain corresponding combined information, and the corresponding words that extracts corresponding collocation relation in the described combined information is a prepare word according to the coded string after the described cutting.
If the user is newly-increased input coding character string on the basis of the coded string of original input, then present embodiment can also obtain corresponding combined information according to the coded string that the user increases newly in described multivariate table.Make the processing of system only limit to change part, avoid system's repetitive operation.
Step 212, calculate with probability now according to the adjacent word frequency of described prepare word with existing words in the dictionary of now frequency and input method system;
Step 213, sort with probability now, and ranking results is exported as candidate item according to described.
As another embodiment, when storing in the described multivariate table with probability now, described step 212 and step 213 can for: the word frequency according to existing words in the dictionary of the same existing probability of described prepare word and input method system is calculated weighted value; Sort according to described weighted value, and ranking results is exported as candidate item.
As another embodiment, when storing the strength of joint value in the described multivariate table, described step 212 and step 213 can for: the word frequency according to existing words in the dictionary of the strength of joint value of described prepare word and input method system is calculated weighted value; Sort according to described weighted value, and ranking results is exported as candidate item.
Describing not detailed part for method shown in Figure 2 can be referring to the description of this instructions front appropriate section.
With reference to figure 3, be the structured flowchart of a kind of input method system embodiment of the present invention, comprise input interface unit 301 and display unit 302; Described input method system also comprises:
Multivariate table 303: described multivariate table is adjacent with existing combined information generation by at least two basic words; Described combined information obtains from the internet corpus that presets, and comprises collocation relation and the adjacent frequency that shows together between described at least two basic words;
Cutting unit 304: be used for the coded string of user's input is carried out cutting;
Extraction unit 305: be used for obtaining corresponding combined information at described multivariate table, and the corresponding words that extracts corresponding collocation relation in the described combined information is a prepare word according to the coded string after the described cutting.
Preferably, described input method system also comprises first output unit: be used for calculating together probability now according to the adjacent word frequency with the existing words of the dictionary that shows frequency and input method system of described prepare word, and, sort with showing probability according to described, and ranking results is exported as candidate item.
As another embodiment, when storing in the described multivariate table with probability now, described input method system also comprises second output unit: be used for calculating weighted value according to the same word frequency that shows the existing words of dictionary of probability and input method system of described prepare word, and, sort according to described weighted value, and ranking results is exported as candidate item.Wherein, described is to calculate acquisition according to the adjacent word frequency with existing words in the dictionary that shows frequency and input method system of described at least two basic words with showing probability.
As another embodiment, when storing the strength of joint value in the described multivariate table, described input method system also comprises the 3rd output unit: be used for calculating weighted value according to the word frequency of the existing words of dictionary of the strength of joint value of described prepare word and input method system, and, sort according to described weighted value, and ranking results is exported as candidate item.Wherein, described strength of joint value is to obtain with showing frequency and showing probability calculation together according to the adjacent of described at least two basic words.
Certainly, can also store in the described multivariate table other any show the numerical value of annexation between the words, those skilled in the art rule of thumb maybe need to select for use to get final product, the present invention is not limited this.
Preferably, described internet corpus is internet blog corpus, internet news corpus and/or internet forum corpus.
Present embodiment is by with the internet corpus being the basis of input method system output words, the combined information that generates can accurately reflect the trend of people on language uses, can guarantee accuracy, the representativeness and comprehensive of combined information, import the first-selected speech hit rate of a plurality of words, phrase, phrase, short sentence or long sentence thereby improve the user, and then effectively improved user's input efficiency.
Preferably, described input method system can also comprise loading unit: be used for described multivariate table is loaded on memory device.This memory device can be the memory device of client, also can be the memory device of server end.
Invalid, repeat calculation process in the present embodiment, effectively conserve system resources, improve the treatment effeciency of system, described input method system can also comprise following system optimization unit:
The unit is optimized in cutting: be used for the cutting method of described coded string is optimized;
And/or, newly-increased acquiring unit: be used for obtaining corresponding combined information at described multivariate table according to newly-increased coded string.
Said system is optimized also combination in any use as required of unit, and those skilled in the art can only use a system optimization unit, also can adopt a plurality of system optimization unit; And in described a plurality of system optimization unit, can carry out combination in any.In addition, those skilled in the art can also set up other various system optimization unit as required on their own, and the present invention does not limit this.
For the memory source that makes described input method system be convenient to Network Transmission, reduces the user takies and improves system handles efficient, the words in the dictionary of combined information in the described multivariate table and described input method system can also be compared in the present embodiment; If there is the words that repeats with described combined information in the described dictionary, then in the input method dictionary, remove this words.Thereby make that the input method system installation kit file that generates is less, greatly reduced the user and used threshold, reduced taking of user storage space, and effectively improved the service efficiency of system.
Input method system shown in Figure 3 can be common input method system, and in this case, the input interface unit of described input method system, display unit and multivariate table are arranged in same computing equipment;
Input method system shown in Figure 3 also can be the input method in network system, in this case, the input interface unit of described input method system, display unit are arranged in first computing equipment, multivariate table is arranged in second computing equipment, described input method system is according to the information of user's input, obtain corresponding combined information from the multivariate table that is arranged in second computing equipment, show corresponding words at first computing equipment.
Because system shown in Figure 3 can corresponding be applicable among the embodiment of aforesaid the whole bag of tricks that so description is comparatively simple, not detailed part can be referring to the description of this instructions front appropriate section.
With reference to figure 4, be a kind of structured flowchart that generates the device embodiment of multivariate table of the present invention, comprise with lower module:
Acquisition module 401: be used for from the internet corpus that presets, it is adjacent with existing combined information to obtain at least two basic words;
Wherein, described combined information comprises collocation relation and the adjacent frequency that shows together between described at least two basic words;
Generation module 402: be used for generating multivariate table according to described combined information.
For fear of invalid, repeat calculation process, preferably, the device of present embodiment can also comprise chooses module 403: be used for choosing the basic words that meets prerequisite from the dictionary of input method system.
Based on one of core idea of multivariate table of the present invention, combined information according to after certain redundant rule elimination low value information, is kept the part of high value information as multivariate table.Preferably, the device of present embodiment can also comprise: first removes module 404: be used for removing this combined information when the adjacent of a combined information when frequency is lower than certain threshold value with showing; And/or second removes module 405: be used for removing when the corresponding words of a combined information is formed for the highest words of two or more word frequency this combined information; And/or the 3rd removes module 406: be used for removing when a combined information is partly or entirely covered by another combined information this combined information.Above-mentioned removal module 404-406 can be single as required or be used in combination, and the present invention does not limit this.
For the multivariate table that makes generation can accurately reflect the trend of people on language uses, can guarantee the representativeness, comprehensive of combined information, thereby improve the first-selected speech hit rate that the user imports a plurality of words, phrase, phrase, short sentence or long sentence, preferably, the device of present embodiment can also comprise webpage acquisition module 407: be used for obtaining pages of Internet by the web crawlers technology; With corpus generation module 408: be used to choose the info web that meets prerequisite, and preserve form the internet corpus.More preferably, described internet corpus is internet blog corpus, internet news corpus and/or internet forum corpus.And can be provided with arbitrarily, upgrade and change by those skilled in the art, this present invention is not needed to limit.
With reference to figure 5, be the process flow diagram that the device of using generation multivariate table shown in Figure 4 generates the preferred embodiment of multivariate table, may further comprise the steps:
Step 501, described webpage acquisition module obtain pages of Internet by the web crawlers technology;
Step 502, described corpus generation module are chosen the info web that meets prerequisite, and preserve formation internet corpus;
Wherein, described internet corpus is internet blog corpus, internet news corpus and/or internet forum corpus.Can also be provided with arbitrarily, upgrade and change by those skilled in the art.
Step 503, the described module of choosing are chosen the basic words that meets prerequisite from the dictionary of input method system;
For example, choose the words of TOP60000 in the dictionary of described input method system from high to low according to word frequency.
Step 504, described acquisition module obtain at least two basic words from the internet corpus that presets adjacent with existing combined information;
Wherein, described combined information comprises collocation relation and the adjacent frequency that shows together between described at least two basic words;
Step 505, adjacent when now frequency is lower than certain threshold value when in the combined information are removed module by described first and are removed this combined information;
Step 506, when the corresponding words in the combined information is the highest words of two or more word frequency when forming, remove module by described second and remove this combined information;
Step 507, when a combined information is partly or entirely covered by another combined information, remove module by the described the 3rd and remove this combined information;
Step 508, described generation module generate multivariate table according to the combined information that filters out.
Because method shown in Figure 5 can corresponding be applicable among the embodiment of aforesaid the whole bag of tricks and system that so description is comparatively simple, not detailed part can be referring to the description of this instructions front appropriate section.
With reference to figure 6, be the process flow diagram that the present invention upgrades the embodiment 1 of input method system, may further comprise the steps:
Step 601, renewal internet corpus;
Those skilled in the art can rule of thumb select any to upgrade the algorithm of internet corpus with needs, and present embodiment does not limit at this.
Preferably, described internet corpus is internet blog corpus, internet news corpus and/or internet forum corpus.Can also be provided with arbitrarily, upgrade and change by those skilled in the art.
Step 602, from the internet corpus that presets, it is adjacent with existing combined information to obtain at least two basic words;
Wherein, described combined information comprises collocation relation and the adjacent frequency that shows together between described at least two basic words.
Step 603, generate multivariate table according to described combined information;
Step 604, described multivariate table is sent to described input method system.
With reference to figure 7, be a kind of process flow diagram that upgrades the embodiment 2 of input method system of the present invention, may further comprise the steps:
Step 701, renewal internet corpus;
Wherein, described internet corpus can be internet blog corpus, internet news corpus and/or internet forum corpus.Can also be provided with arbitrarily, upgrade and change by those skilled in the art.
Step 702, from the dictionary of input method system, choose the basic words that meets prerequisite;
For example, choose the words of TOP60000 in the dictionary of described input method system from high to low according to word frequency.
Step 703, to obtain at least two basic words from described internet corpus adjacent with existing combined information;
Wherein, described combined information comprises collocation relation and the adjacent frequency that shows together between described at least two basic words.
If the adjacent frequency that shows together in step 704 combined information is lower than certain threshold value, then remove this combined information;
The highest words is formed if the corresponding words in step 705 combined information is two or more word frequency, then removes this combined information;
If combined information of step 706 is partly or entirely covered by another combined information, then remove this combined information;
The combined information that step 707, basis filter out generates multivariate table;
Step 708, described multivariate table is sent to described input method system.
As another embodiment, described step 704-step 706 can be provided with or make up setting as required separately, and the present invention does not need this to limit.
In the above-described embodiments, the description of each embodiment is all emphasized particularly on different fields, do not have the part that describes in detail among certain embodiment, can get final product referring to the associated description of aforementioned part.Above-mentionedly arbitrarily enumerated several embodiment of the present invention, those skilled in the art are appropriate combination, selection as the case may be, can bring into play technology effect of the present invention fully.Combination in any based on the foregoing description all is embodiment of the present invention, but this instructions has not just described in detail one by one at this as space is limited.
Because Fig. 6 and method shown in Figure 7 can corresponding be applicable among the embodiment of aforesaid the whole bag of tricks and system that so description is comparatively simple, not detailed part can be referring to the description of this instructions front appropriate section.
More than the method for a kind of intelligent word provided by the present invention, a kind of input method system, a kind of device and a kind of method of upgrading input method system that generates multivariate table are described in detail, used specific case herein principle of the present invention and embodiment are set forth, the explanation of above embodiment just is used for helping to understand method of the present invention and core concept thereof; Simultaneously, for one of ordinary skill in the art, according to thought of the present invention, the part that all can change in specific embodiments and applications, in sum, this description should not be construed as limitation of the present invention.

Claims (28)

1, a kind of method of intelligent word input is characterized in that, comprising:
From the internet corpus that presets, it is adjacent with existing combined information to obtain at least two basic words, and described combined information comprises collocation relation and the adjacent frequency that shows together between described at least two basic words;
Generate multivariate table according to described combined information;
Receive the coded string of user's input, and described coded string is carried out cutting;
In described multivariate table, obtain corresponding combined information according to the coded string after the described cutting, and the corresponding words that extracts corresponding collocation relation in the described combined information is a prepare word.
2, the method for claim 1 is characterized in that, also comprises:
Calculate together probability now according to the adjacent of described prepare word with the word frequency that has words in the dictionary that shows frequency and input method system, and, sort according to the described probability that shows together, and ranking results is exported as candidate item.
3, the method for claim 1, it is characterized in that, described multivariate table stores together probability now, and described is to calculate acquisition according to the adjacent word frequency with existing words in the dictionary that shows frequency and input method system of described at least two basic words with showing probability, and described method also comprises:
Word frequency according to existing words in the dictionary of the same existing probability of described prepare word and input method system is calculated weighted value, and, sort according to described weighted value, and ranking results is exported as candidate item.
4, the method for claim 1 is characterized in that, described multivariate table stores the strength of joint value, and described strength of joint value is to obtain with showing frequency and showing probability calculation together according to the adjacent of described at least two basic words, and described method also comprises:
Word frequency according to existing words in the dictionary of the strength of joint value of described prepare word and input method system is calculated weighted value, and, sort according to described weighted value, and ranking results is exported as candidate item.
5, the method for claim 1 is characterized in that, also comprises:
From the dictionary of input method system, choose the basic words that meets prerequisite.
6, as the described method of above-mentioned each claim, it is characterized in that, before generating multivariate table, also comprise:
If the adjacent frequency that shows together in the combined information is lower than certain threshold value, then remove this combined information;
The highest words is formed if the corresponding words in the combined information is two or more word frequency, then removes this combined information;
If one combined information is partly or entirely covered by another combined information, then remove this combined information.
7, as the described method of above-mentioned each claim, it is characterized in that, preset described internet corpus by following steps:
Obtain pages of Internet by the web crawlers technology;
Choose the info web that meets prerequisite, and preserve formation internet corpus.
8, method as claimed in claim 7 is characterized in that, described internet corpus is internet blog corpus, internet news corpus and/or internet forum corpus.
9, the method for claim 1 is characterized in that, also comprises step before the coded string that receives user's input: described multivariate table is loaded in the memory device.
10, the method for claim 1 is characterized in that, also comprises:
Cutting method to described coded string is optimized.
11, as claim 1 or 10 described methods, it is characterized in that, also comprise:
The coded string that increases newly according to the user obtains corresponding combined information in described multivariate table.
12, a kind of input method system comprises input interface unit and display unit, it is characterized in that, described input method system also comprises:
Multivariate table: described multivariate table is adjacent with existing combined information generation by at least two basic words; Described combined information obtains from the internet corpus that presets, and comprises collocation relation and the adjacent frequency that shows together between described at least two basic words;
Cutting unit: be used for the coded string of user's input is carried out cutting;
Extraction unit: be used for obtaining corresponding combined information at described multivariate table, and the corresponding words that extracts corresponding collocation relation in the described combined information is a prepare word according to the coded string after the described cutting.
13, system as claimed in claim 12 is characterized in that, described input method system also comprises:
First output unit: be used for calculating with probability now with the word frequency of the existing words of dictionary of frequency and input method system now according to the adjacent of described prepare word, and, sort with probability now according to described, and ranking results is exported as candidate item.
14, system as claimed in claim 12, it is characterized in that, described multivariate table stores with showing probability, described is to calculate acquisition according to the adjacent word frequency with existing words in the dictionary that shows frequency and input method system of described at least two basic words with showing probability, and described input method system also comprises:
Second output unit: be used for calculating weighted value according to the same word frequency that shows the existing words of dictionary of probability and input method system of described prepare word, and, sort according to described weighted value, and ranking results is exported as candidate item.
15, system as claimed in claim 12, it is characterized in that, described multivariate table stores the strength of joint value, and described strength of joint value is to obtain with showing frequency and showing probability calculation together according to the adjacent of described at least two basic words, and described input method system also comprises:
The 3rd output unit: be used for calculating weighted value according to the word frequency of the existing words of dictionary of the strength of joint value of described prepare word and input method system, and, sort according to described weighted value, and ranking results is exported as candidate item.
16, system as claimed in claim 12 is characterized in that, described internet corpus is internet blog corpus, internet news corpus and/or internet forum corpus.
17, system as claimed in claim 12 is characterized in that, described input method system also comprises:
Loading unit: be used for described multivariate table is loaded on memory device.
18, system as claimed in claim 12 is characterized in that, described input method system also comprises:
The unit is optimized in cutting: be used for the cutting method of described coded string is optimized.
19, as claim 12 or 18 described systems, it is characterized in that described input method system also comprises:
Newly-increased acquiring unit: the coded string that is used for increasing newly according to the user obtains corresponding combined information at described multivariate table.
20, system as claimed in claim 12 is characterized in that, the input interface unit of described input method system, display unit and multivariate table are arranged in same computing equipment;
Perhaps, the input interface unit of described input method system, display unit are arranged in first computing equipment, multivariate table is arranged in second computing equipment, described input method system is according to the information of user's input, obtain corresponding combined information from the multivariate table that is arranged in second computing equipment, show corresponding words at first computing equipment.
21, a kind of device that generates multivariate table is characterized in that, comprising:
Acquisition module: be used for from the internet corpus that presets, it is adjacent with existing combined information to obtain at least two basic words, and described combined information comprises collocation relation and the adjacent frequency that shows together between described at least two basic words;
Generation module: be used for generating multivariate table according to described combined information.
22, device as claimed in claim 21 is characterized in that, also comprises:
Choose module: be used for choosing the basic words that meets prerequisite from the dictionary of input method system.
23, as claim 21 or 22 described devices, it is characterized in that, also comprise:
First removes module: be used for removing this combined information when the adjacent of a combined information when frequency is lower than certain threshold value with showing;
And/or second removes module: be used for removing when the corresponding words of a combined information is formed for the highest words of two or more word frequency this combined information;
And/or the 3rd removes module: be used for removing when a combined information is partly or entirely covered by another combined information this combined information.
24, as claim 21 or 22 described devices, it is characterized in that, also comprise:
Webpage acquisition module: be used for obtaining pages of Internet by the web crawlers technology;
Corpus generation module: be used to choose the info web that meets prerequisite, and preserve formation internet corpus.
25, device as claimed in claim 24 is characterized in that, described internet corpus is internet blog corpus, internet news corpus and/or internet forum corpus.
26, a kind of method of upgrading input method system is characterized in that, comprising:
Upgrade the internet corpus;
From the internet corpus that presets, it is adjacent with existing combined information to obtain at least two basic words, and described combined information comprises collocation relation and the adjacent frequency that shows together between described at least two basic words;
Generate multivariate table according to described combined information;
Described multivariate table is sent to input method system.
27, method as claimed in claim 26 is characterized in that, also comprises:
From the dictionary of input method system, choose the basic words that meets prerequisite.
28, as claim 26 or 27 described methods, it is characterized in that, before generating multivariate table, also comprise:
If the adjacent frequency that shows together in the combined information is lower than certain threshold value, then remove this combined information;
The highest words is formed if the corresponding words in the combined information is two or more word frequency, then removes this combined information;
If one combined information is partly or entirely covered by another combined information, then remove this combined information.
CNB2007100792674A 2007-02-13 2007-02-13 Intelligent word input method and input method system and updating method thereof Expired - Fee Related CN100458795C (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CNB2007100792674A CN100458795C (en) 2007-02-13 2007-02-13 Intelligent word input method and input method system and updating method thereof
PCT/CN2008/070270 WO2008098507A1 (en) 2007-02-13 2008-02-03 An input method of combining words intelligently, input method system and renewing method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNB2007100792674A CN100458795C (en) 2007-02-13 2007-02-13 Intelligent word input method and input method system and updating method thereof

Publications (2)

Publication Number Publication Date
CN101013443A true CN101013443A (en) 2007-08-08
CN100458795C CN100458795C (en) 2009-02-04

Family

ID=38700955

Family Applications (1)

Application Number Title Priority Date Filing Date
CNB2007100792674A Expired - Fee Related CN100458795C (en) 2007-02-13 2007-02-13 Intelligent word input method and input method system and updating method thereof

Country Status (2)

Country Link
CN (1) CN100458795C (en)
WO (1) WO2008098507A1 (en)

Cited By (37)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101882128A (en) * 2010-06-11 2010-11-10 宇龙计算机通信科技(深圳)有限公司 Method for generating commonly used terms of information and mobile terminal
CN101661463B (en) * 2009-09-18 2011-04-06 杨盛 Automatic collating method in character input process
CN102193639A (en) * 2010-03-04 2011-09-21 阿里巴巴集团控股有限公司 Method and device of statement generation
CN102402298A (en) * 2010-09-16 2012-04-04 腾讯科技(深圳)有限公司 Pinyin input method and user word adding method and system of same
CN101556596B (en) * 2007-08-31 2012-04-18 北京搜狗科技发展有限公司 Input method system and intelligent word making method
CN102455786A (en) * 2010-10-25 2012-05-16 三星电子(中国)研发中心 System and method for optimizing Chinese sentence input method
CN102541278A (en) * 2010-12-25 2012-07-04 上海量明科技发展有限公司 Method and system for character selection in word input interface
CN102566775A (en) * 2010-12-31 2012-07-11 上海量明科技发展有限公司 Input method and system for generating character interval
CN102567365A (en) * 2010-12-26 2012-07-11 上海量明科技发展有限公司 Input method and input system based on labeling specific to a keyword
CN102945086A (en) * 2012-11-22 2013-02-27 黑龙江大学 Super input action capture record system and capture record method
CN103024159A (en) * 2012-11-28 2013-04-03 东莞宇龙通信科技有限公司 Information generation method and information generation system
CN103064967A (en) * 2012-12-31 2013-04-24 百度在线网络技术(北京)有限公司 Method and device used for establishing user binary relation bases
CN103076892A (en) * 2012-12-31 2013-05-01 百度在线网络技术(北京)有限公司 Method and equipment for providing input candidate items corresponding to input character string
CN103473036A (en) * 2012-06-08 2013-12-25 深圳市世纪光速信息技术有限公司 Input method skin push method and system
CN103869999A (en) * 2012-12-11 2014-06-18 百度国际科技(深圳)有限公司 Method and device for sorting candidate items generated by input method
CN103929448A (en) * 2013-01-14 2014-07-16 百度国际科技(深圳)有限公司 Method, system and device for providing cell word stock in cloud server
CN103927299A (en) * 2014-04-25 2014-07-16 百度在线网络技术(北京)有限公司 Method for providing candidate sentences in input method and method and device for recommending input content
CN104360759A (en) * 2014-11-21 2015-02-18 百度在线网络技术(北京)有限公司 Candidate character sequencing method and device as well as character input method and equipment
CN105095191A (en) * 2014-04-22 2015-11-25 富士通株式会社 Method and device for assisted translation based on multi-word units
CN105607753A (en) * 2015-12-15 2016-05-25 上海嵩恒网络科技有限公司 Long sentence input method and long sentence input system for five strokes
CN106445177A (en) * 2015-08-06 2017-02-22 阿尔派株式会社 Character input device and character input method
CN106557178A (en) * 2016-11-29 2017-04-05 百度国际科技(深圳)有限公司 For updating the method and device of input method entry
CN107340881A (en) * 2016-05-03 2017-11-10 北京搜狗科技发展有限公司 A kind of input method and electronic equipment
CN107422872A (en) * 2016-05-24 2017-12-01 北京搜狗科技发展有限公司 A kind of input method, device and the device for input
CN107688398A (en) * 2016-08-03 2018-02-13 中国科学院计算技术研究所 Determine the method and apparatus and input reminding method and device of candidate's input
CN108073292A (en) * 2016-11-11 2018-05-25 北京搜狗科技发展有限公司 A kind of intelligent word method and apparatus, a kind of device for intelligent word
CN108073293A (en) * 2016-11-11 2018-05-25 北京搜狗科技发展有限公司 A kind of definite method and apparatus of target phrase
CN108241438A (en) * 2016-12-23 2018-07-03 北京搜狗科技发展有限公司 A kind of input method, device and the device for input
CN108628906A (en) * 2017-03-24 2018-10-09 北京京东尚科信息技术有限公司 Short text template method for digging, device, electronic equipment and readable storage medium storing program for executing
CN108803890A (en) * 2017-04-28 2018-11-13 北京搜狗科技发展有限公司 A kind of input method, input unit and the device for input
CN109144284A (en) * 2017-06-15 2019-01-04 百度在线网络技术(北京)有限公司 information display method and device
CN109426358A (en) * 2017-09-01 2019-03-05 百度在线网络技术(北京)有限公司 Data inputting method and device
CN109542243A (en) * 2017-09-21 2019-03-29 北京搜狗科技发展有限公司 Phrase composing method and device, for the device of group word
CN109917927A (en) * 2017-12-13 2019-06-21 北京搜狗科技发展有限公司 A kind of candidate item determines method and apparatus
CN109961791A (en) * 2017-12-22 2019-07-02 北京搜狗科技发展有限公司 A kind of voice information processing method, device and electronic equipment
CN110781288A (en) * 2019-10-30 2020-02-11 安阳师范学院 Method and device for composing words by Chinese characters
CN111913591A (en) * 2020-06-23 2020-11-10 杭州电子科技大学 Reply phrase generation method, pinyin input method and intelligent terminal

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101895631A (en) * 2010-07-09 2010-11-24 深圳市五巨科技有限公司 Method, device and system for intelligently switching input method by mobile terminal
CN102508554A (en) * 2011-10-02 2012-06-20 上海量明科技发展有限公司 Input method with communication association, personal repertoire and system
CN102495679A (en) * 2011-12-01 2012-06-13 上海量明科技发展有限公司 Composite spelling input method, word bank and system thereof
CN104281274A (en) * 2014-09-03 2015-01-14 深圳市金立通信设备有限公司 Input method
CN107122060A (en) * 2017-03-15 2017-09-01 韦柳志 A kind of method that candidate item is handled in input method
CN109213988B (en) * 2017-06-29 2022-06-21 武汉斗鱼网络科技有限公司 Barrage theme extraction method, medium, equipment and system based on N-gram model
CN112199031B (en) * 2020-10-15 2022-08-05 科大讯飞股份有限公司 Input method, device, equipment and storage medium

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1226717C (en) * 2000-08-30 2005-11-09 国际商业机器公司 Automatic new term fetch method and system
US7478033B2 (en) * 2004-03-16 2009-01-13 Google Inc. Systems and methods for translating Chinese pinyin to Chinese characters
CN100401301C (en) * 2006-05-30 2008-07-09 南京大学 Body learning based intelligent subject-type network reptile system configuration method

Cited By (60)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101556596B (en) * 2007-08-31 2012-04-18 北京搜狗科技发展有限公司 Input method system and intelligent word making method
CN101661463B (en) * 2009-09-18 2011-04-06 杨盛 Automatic collating method in character input process
CN102193639A (en) * 2010-03-04 2011-09-21 阿里巴巴集团控股有限公司 Method and device of statement generation
CN101882128A (en) * 2010-06-11 2010-11-10 宇龙计算机通信科技(深圳)有限公司 Method for generating commonly used terms of information and mobile terminal
CN102402298A (en) * 2010-09-16 2012-04-04 腾讯科技(深圳)有限公司 Pinyin input method and user word adding method and system of same
CN102455786A (en) * 2010-10-25 2012-05-16 三星电子(中国)研发中心 System and method for optimizing Chinese sentence input method
CN102455786B (en) * 2010-10-25 2014-09-03 三星电子(中国)研发中心 System and method for optimizing Chinese sentence input method
CN102541278A (en) * 2010-12-25 2012-07-04 上海量明科技发展有限公司 Method and system for character selection in word input interface
CN102567365A (en) * 2010-12-26 2012-07-11 上海量明科技发展有限公司 Input method and input system based on labeling specific to a keyword
CN102567365B (en) * 2010-12-26 2016-07-06 上海量明科技发展有限公司 A kind of it is directed to input method and the system that key word is labeled
CN102566775A (en) * 2010-12-31 2012-07-11 上海量明科技发展有限公司 Input method and system for generating character interval
CN103473036A (en) * 2012-06-08 2013-12-25 深圳市世纪光速信息技术有限公司 Input method skin push method and system
CN103473036B (en) * 2012-06-08 2018-04-27 深圳市世纪光速信息技术有限公司 A kind of input method skin method for pushing and system
CN102945086A (en) * 2012-11-22 2013-02-27 黑龙江大学 Super input action capture record system and capture record method
CN102945086B (en) * 2012-11-22 2015-09-09 黑龙江大学 Super input behavior is grabbed recording system and is grabbed recording method
CN103024159B (en) * 2012-11-28 2015-01-21 东莞宇龙通信科技有限公司 Information generation method and information generation system
CN103024159A (en) * 2012-11-28 2013-04-03 东莞宇龙通信科技有限公司 Information generation method and information generation system
CN103869999B (en) * 2012-12-11 2018-10-16 百度国际科技(深圳)有限公司 The method and device that candidate item caused by input method is ranked up
CN103869999A (en) * 2012-12-11 2014-06-18 百度国际科技(深圳)有限公司 Method and device for sorting candidate items generated by input method
CN103076892A (en) * 2012-12-31 2013-05-01 百度在线网络技术(北京)有限公司 Method and equipment for providing input candidate items corresponding to input character string
US20150293972A1 (en) * 2012-12-31 2015-10-15 Baidu Online Network Technology (Beijing) Co., Ltd. Method and device used for providing input candidate items corresponding to an input character string
CN103064967B (en) * 2012-12-31 2018-10-12 百度在线网络技术(北京)有限公司 A kind of method and apparatus for establishing user's binary crelation library
CN103064967A (en) * 2012-12-31 2013-04-24 百度在线网络技术(北京)有限公司 Method and device used for establishing user binary relation bases
CN103076892B (en) * 2012-12-31 2016-09-28 百度在线网络技术(北京)有限公司 A kind of method and apparatus of the input candidate item for providing corresponding to input character string
CN103929448A (en) * 2013-01-14 2014-07-16 百度国际科技(深圳)有限公司 Method, system and device for providing cell word stock in cloud server
CN103929448B (en) * 2013-01-14 2018-06-05 百度国际科技(深圳)有限公司 Server provides the method, system and device of cell dictionary beyond the clouds
CN105095191A (en) * 2014-04-22 2015-11-25 富士通株式会社 Method and device for assisted translation based on multi-word units
CN103927299A (en) * 2014-04-25 2014-07-16 百度在线网络技术(北京)有限公司 Method for providing candidate sentences in input method and method and device for recommending input content
CN104360759B (en) * 2014-11-21 2017-03-08 百度在线网络技术(北京)有限公司 Candidate word sort method, device and character input method, equipment
CN104360759A (en) * 2014-11-21 2015-02-18 百度在线网络技术(北京)有限公司 Candidate character sequencing method and device as well as character input method and equipment
CN106445177A (en) * 2015-08-06 2017-02-22 阿尔派株式会社 Character input device and character input method
CN106445177B (en) * 2015-08-06 2020-06-30 阿尔派株式会社 Character input device and character input method
CN105607753B (en) * 2015-12-15 2018-03-30 上海嵩恒网络科技有限公司 The long sentence input method and long sentence input system of a kind of five
CN105607753A (en) * 2015-12-15 2016-05-25 上海嵩恒网络科技有限公司 Long sentence input method and long sentence input system for five strokes
CN107340881A (en) * 2016-05-03 2017-11-10 北京搜狗科技发展有限公司 A kind of input method and electronic equipment
CN107340881B (en) * 2016-05-03 2021-11-30 北京搜狗科技发展有限公司 Input method and electronic equipment
CN107422872A (en) * 2016-05-24 2017-12-01 北京搜狗科技发展有限公司 A kind of input method, device and the device for input
CN107422872B (en) * 2016-05-24 2021-11-30 北京搜狗科技发展有限公司 Input method, input device and input device
CN107688398A (en) * 2016-08-03 2018-02-13 中国科学院计算技术研究所 Determine the method and apparatus and input reminding method and device of candidate's input
CN107688398B (en) * 2016-08-03 2019-09-17 中国科学院计算技术研究所 It determines the method and apparatus of candidate input and inputs reminding method and device
CN108073293B (en) * 2016-11-11 2022-01-14 北京搜狗科技发展有限公司 Method and device for determining target phrase
CN108073293A (en) * 2016-11-11 2018-05-25 北京搜狗科技发展有限公司 A kind of definite method and apparatus of target phrase
CN108073292A (en) * 2016-11-11 2018-05-25 北京搜狗科技发展有限公司 A kind of intelligent word method and apparatus, a kind of device for intelligent word
CN106557178A (en) * 2016-11-29 2017-04-05 百度国际科技(深圳)有限公司 For updating the method and device of input method entry
CN106557178B (en) * 2016-11-29 2021-03-09 百度国际科技(深圳)有限公司 Method and device for updating entries of input method
CN108241438B (en) * 2016-12-23 2022-02-25 北京搜狗科技发展有限公司 Input method, input device and input device
CN108241438A (en) * 2016-12-23 2018-07-03 北京搜狗科技发展有限公司 A kind of input method, device and the device for input
CN108628906A (en) * 2017-03-24 2018-10-09 北京京东尚科信息技术有限公司 Short text template method for digging, device, electronic equipment and readable storage medium storing program for executing
CN108803890B (en) * 2017-04-28 2024-02-06 北京搜狗科技发展有限公司 Input method, input device and input device
CN108803890A (en) * 2017-04-28 2018-11-13 北京搜狗科技发展有限公司 A kind of input method, input unit and the device for input
CN109144284A (en) * 2017-06-15 2019-01-04 百度在线网络技术(北京)有限公司 information display method and device
CN109144284B (en) * 2017-06-15 2022-07-15 百度在线网络技术(北京)有限公司 Information display method and device
CN109426358A (en) * 2017-09-01 2019-03-05 百度在线网络技术(北京)有限公司 Data inputting method and device
CN109542243A (en) * 2017-09-21 2019-03-29 北京搜狗科技发展有限公司 Phrase composing method and device, for the device of group word
CN109917927A (en) * 2017-12-13 2019-06-21 北京搜狗科技发展有限公司 A kind of candidate item determines method and apparatus
CN109961791B (en) * 2017-12-22 2021-10-22 北京搜狗科技发展有限公司 Voice information processing method and device and electronic equipment
CN109961791A (en) * 2017-12-22 2019-07-02 北京搜狗科技发展有限公司 A kind of voice information processing method, device and electronic equipment
CN110781288A (en) * 2019-10-30 2020-02-11 安阳师范学院 Method and device for composing words by Chinese characters
CN111913591A (en) * 2020-06-23 2020-11-10 杭州电子科技大学 Reply phrase generation method, pinyin input method and intelligent terminal
CN111913591B (en) * 2020-06-23 2023-10-20 杭州电子科技大学 Reply phrase generation method, pinyin input method and intelligent terminal

Also Published As

Publication number Publication date
CN100458795C (en) 2009-02-04
WO2008098507A1 (en) 2008-08-21

Similar Documents

Publication Publication Date Title
CN100458795C (en) Intelligent word input method and input method system and updating method thereof
US8417512B2 (en) Method, used by computers, for developing an ontology from a text in natural language
CN100595763C (en) Full text retrieval system based on natural language
US9448995B2 (en) Method and device for performing natural language searches
CN100405371C (en) Method and system for abstracting new word
CN104021198B (en) The relational database information search method and device indexed based on Ontology
CN101710343A (en) Body automatic build system and method based on text mining
CN102253930B (en) A kind of method of text translation and device
CN104281702A (en) Power keyword segmentation based data retrieval method and device
CN103678576A (en) Full-text retrieval system based on dynamic semantic analysis
CN105045852A (en) Full-text search engine system for teaching resources
CN106055623A (en) Cross-language recommendation method and system
CN103324700A (en) Noumenon concept attribute learning method based on Web information
CN103324626A (en) Method for setting multi-granularity dictionary and segmenting words and device thereof
CN105740227A (en) Genetic simulated annealing method for solving new words in Chinese segmentation
Kallimani et al. Information extraction by an abstractive text summarization for an Indian regional language
CN102043793A (en) Knowledge-service-oriented recommendation method
CN114579104A (en) Data analysis scene generation method, device, equipment and storage medium
Tonelli et al. Boosting collaborative ontology building with key-concept extraction
US9292537B1 (en) Autocompletion of filename based on text in a file to be saved
Baykara et al. Abstractive text summarization and new large-scale datasets for agglutinative languages Turkish and Hungarian
CN101216836B (en) Web page anchor text denoising system and method
CN104778232A (en) Searching result optimizing method and device based on long query
CN103020311A (en) Method and system for processing user search terms
Shrawankar et al. Construction of news headline from detailed news article

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20090204

CF01 Termination of patent right due to non-payment of annual fee