CN101013443A - Intelligent word input method and input method system and updating method thereof - Google Patents
Intelligent word input method and input method system and updating method thereof Download PDFInfo
- Publication number
- CN101013443A CN101013443A CNA2007100792674A CN200710079267A CN101013443A CN 101013443 A CN101013443 A CN 101013443A CN A2007100792674 A CNA2007100792674 A CN A2007100792674A CN 200710079267 A CN200710079267 A CN 200710079267A CN 101013443 A CN101013443 A CN 101013443A
- Authority
- CN
- China
- Prior art keywords
- combined information
- words
- input method
- internet
- corpus
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 251
- 238000004364 calculation method Methods 0.000 claims abstract description 15
- 239000000284 extract Substances 0.000 claims abstract description 8
- 238000005520 cutting process Methods 0.000 claims description 37
- 238000005516 engineering process Methods 0.000 claims description 16
- 230000015572 biosynthetic process Effects 0.000 claims description 5
- 238000000605 extraction Methods 0.000 claims description 3
- 230000011218 segmentation Effects 0.000 abstract 1
- 230000000875 corresponding effect Effects 0.000 description 43
- 238000010586 diagram Methods 0.000 description 10
- 238000005457 optimization Methods 0.000 description 8
- 230000000694 effects Effects 0.000 description 5
- 230000008676 import Effects 0.000 description 5
- 239000012467 final product Substances 0.000 description 4
- 230000000116 mitigating effect Effects 0.000 description 3
- 238000003058 natural language processing Methods 0.000 description 3
- 238000012545 processing Methods 0.000 description 3
- 239000000047 product Substances 0.000 description 3
- 238000003860 storage Methods 0.000 description 3
- 230000005540 biological transmission Effects 0.000 description 2
- 230000006835 compression Effects 0.000 description 2
- 238000007906 compression Methods 0.000 description 2
- 230000008030 elimination Effects 0.000 description 2
- 238000003379 elimination reaction Methods 0.000 description 2
- 238000009434 installation Methods 0.000 description 2
- 238000012856 packing Methods 0.000 description 2
- 238000013138 pruning Methods 0.000 description 2
- 230000003252 repetitive effect Effects 0.000 description 2
- 239000002699 waste material Substances 0.000 description 2
- 241000721047 Danaus plexippus Species 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 230000003203 everyday effect Effects 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 230000033764 rhythmic process Effects 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
- 238000007789 sealing Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 238000005406 washing Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/01—Input arrangements or combined input and output arrangements for interaction between user and computer
- G06F3/02—Input arrangements using manually operated switches, e.g. using keyboards or dials
- G06F3/023—Arrangements for converting discrete items of information into a coded form, e.g. arrangements for interpreting keyboard generated codes as alphanumeric codes, operand codes or instruction codes
- G06F3/0233—Character input methods
- G06F3/0237—Character input methods using prediction or retrieval techniques
Abstract
The invention discloses an intelligent word-group input method in input system, including get the portfolio information between at least two basic words from the pre-set internet corpus, which including the combination relationship and adjacent same frequency of (the least two) words; generate multi-table according to the portfolio information; receive the user input encoding string, and segment the encoding string; get the combination information in the multi-table according to segmentation string, and extract the words with the corresponding relations in combination information as the candidate words. The invention can improve the first selected words hit rate of user input words, phrases, short sentences or long sentences to avoid ineffective repeat calculation process to improve the efficiency of the user input.
Description
Technical field
The present invention relates to the input method system data processing field, particularly relate to a kind of method, a kind of input method system, a kind of device and a kind of method of upgrading input method system that generates multivariate table of intelligent word input.
Background technology
All there are the corresponding a plurality of candidate's word problems of same-code inevitably in current input method system (comprising Chinese, Japanese and Korean or the like), with the spelling input method is example, as: phonetic adds input method, purple light China space spelling input method etc., this existing input method all is based on the ordering that word frequency (usage frequency of words) in its dictionary and the dictionary comes to provide in the information input process for the user candidate word, the preferential the highest words commonly used of word frequency, the promptly first-selected speech of showing.The ordering of candidate word is an important indicator of user's first-selected speech hit rate height in the information input process.Described first-selected speech hit rate is meant that after the user imported certain keypad information, sorting the preceding, word, speech or sentence were that the user needs most.For example, input Pinyin " guan xi tui 1i " (relation inference), described existing input method can be obtained candidate word all in the dictionary according to phonetic " guan xi ", as " relation ", " washing " and " Northwest " etc., show preferentially that then the highest everyday words of word frequency " relation " is first-selected speech, simultaneously, obtaining the speech that word frequency is the highest in the dictionary " reasoning " according to " tui li " is first-selected speech, forms " relation inference " and offers user's input.In this example, the hit rate of first-selected speech is 100%, promptly meets user's needs fully.
Certainly, say technically, input method system itself can't know which words is that the user needs most, but in vast as the open sea Chinese words, the use of each words is different with the frequency of occurrences, the ordering of words that the frequency of occurrences is higher is in the preceding first-selected speech hit rate that just can improve input method system greatly, promptly can satisfy the possibility of user's needs from sort words the preceding of probability raising.
Yet, if the needed words of user does not also correspond to the highest words of word frequency, for example, user's input " zi zhu xue xiao " (subsidizing school), and input method gets access to the highest speech of word frequency accordingly and is " autonomous school ", in this case, just need the user in all candidate word, to select " subsidy ", to obtain needed result.In practice, the user adopts existing input method more much higher than the probability that directly gets access to effective first-selected speech by the result's of selection candidate word acquisition needs probability, this just shows, the first-selected speech hit rate of existing input method is not high, thereby cause user's input speed to slow down, input efficiency reduces, poor user experience.
For the problems referred to above, prior art has proposed following two kinds of solutions:
First kind, increase the words in the described input method dictionary;
In this case, need to increase abundant words in the dictionary of described input method, just can reach corresponding effects.For example, if the user wants input " intelligent word ", then must in dictionary, store " intelligence ", " group speech " and " intelligent word " three speech, even may also must store " smart group " this speech that does not have concrete implication.For phrase or sentence that some a plurality of speech are formed, the speech of required interpolation will be more.So, dictionary can be more and more too fat to move, can take more space, waste more resources simultaneously.
Second kind, application NLP (Nature Language Process natural language processing) technology.
In described input method system, use this technology and can pass through part of speech, modes such as syntactic analysis improve first-selected speech hit rate, for example, Microsoft's spelling input method has just been used a kind of NLP technology, this technology combines N-gram statistical language model and language rule and instructs phonetic to flow to the transfer process of word flow jointly, main grammatical and semantic taxonomic hierarchies according to " detailed annotation of Modern Chinese syntactic information dictionary " and " synonym speech woods ", by summing up the grammatical and semantic between the various parts of speech, and the rule of the collocation between these parts of speech of human-edited and corresponding attribute word finder embody.
Yet, realizing that this input method system needs those skilled in the art to analyze and edit based on fixing corpus, the technology implementation procedure is complicated and loaded down with trivial details; And described fixedly corpus can not upgrade arbitrarily, causes poor user experience; In addition, this input method system need take bigger space, and such as Microsoft's spelling input method, the size of its installation kit has just surpassed 70,000,000, uses threshold higher, waste user's system resource.
Therefore, present stage needs the urgent technical matters that solves of those skilled in the art to be exactly, and how under situation about economizing on resources as much as possible, the raising input method system is for the first-selected speech hit rate of a plurality of words, phrase, phrase, short sentence or long sentence.
Summary of the invention
Technical matters to be solved by this invention provides a kind of method and a kind of input method system of intelligent word input, to solve problems such as the first-selected speech hit rate for a plurality of words, phrase, phrase, short sentence or long sentence is not high in the prior art, resource occupation is too much.
Another object of the present invention has provided a kind of method and a kind of method of upgrading input method system that generates multivariate table, to guarantee accuracy, the representativeness and comprehensive of output words, import the first-selected speech hit rate of a plurality of words, phrase, phrase, short sentence or long sentence thereby improve the user, and then effectively improved user's input efficiency.
In order to solve the problems of the technologies described above, the embodiment of the invention discloses a kind of method of intelligent word input, comprising:
From the internet corpus that presets, it is adjacent with existing combined information to obtain at least two basic words, and described combined information comprises collocation relation and the adjacent frequency that shows together between described at least two basic words; Generate multivariate table according to described combined information; Receive the coded string of user's input, and described coded string is carried out cutting; In described multivariate table, obtain corresponding combined information according to the coded string after the described cutting, and the corresponding words that extracts corresponding collocation relation in the described combined information is a prepare word.
Preferably, described method also comprises: the adjacent word frequency with existing words in the dictionary that shows frequency and input method system according to described prepare word is calculated together probability now, and, sort with showing probability according to described, and ranking results is exported as candidate item.
Preferably, described multivariate table stores with showing probability, described is to calculate acquisition according to the adjacent word frequency with existing words in the dictionary that shows frequency and input method system of described at least two basic words with showing probability, described method also comprises: the word frequency according to existing words in the dictionary of the same existing probability of described prepare word and input method system is calculated weighted value, and, sort according to described weighted value, and ranking results is exported as candidate item.
Preferably, described multivariate table stores the strength of joint value, described strength of joint value is to obtain with showing frequency and showing probability calculation together according to the adjacent of described at least two basic words, described method also comprises: the word frequency according to existing words in the dictionary of the strength of joint value of described prepare word and input method system is calculated weighted value, and, sort according to described weighted value, and ranking results is exported as candidate item.
Preferably, described method also comprises: choose the basic words that meets prerequisite from the dictionary of input method system.
Preferably, described method before generating multivariate table, also comprises:
If the adjacent frequency that shows together in the combined information is lower than certain threshold value, then remove this combined information;
The highest words is formed if the corresponding words in the combined information is two or more word frequency, then removes this combined information;
If one combined information is partly or entirely covered by another combined information, then remove this combined information.
Preferably, preset described internet corpus by following steps: obtain pages of Internet by the web crawlers technology; Choose the info web that meets prerequisite, and preserve formation internet corpus.
Preferably, described internet corpus is internet blog corpus, internet news corpus and/or internet forum corpus.
Preferably, described method also comprised step: described multivariate table is loaded in the memory device before the coded string that receives user's input.
Preferably, described method also comprises: the cutting method to described coded string is optimized.
Preferably, described method also comprises: the coded string that increases newly according to the user obtains corresponding combined information in described multivariate table.
The embodiment of the invention also provides a kind of input method system, and described input method system comprises input interface unit and display unit, also comprises multivariate table: described multivariate table is adjacent with existing combined information generation by at least two basic words; Described combined information obtains from the internet corpus that presets, and comprises collocation relation and the adjacent frequency that shows together between described at least two basic words; Cutting unit: be used for the coded string of user's input is carried out cutting; Extraction unit: be used for obtaining corresponding combined information at described multivariate table, and the corresponding words that extracts corresponding collocation relation in the described combined information is a prepare word according to the coded string after the described cutting.
Preferably, described input method system also comprises first output unit: be used for calculating together probability now according to the adjacent word frequency with the existing words of the dictionary that shows frequency and input method system of described prepare word, and, sort with showing probability according to described, and ranking results is exported as candidate item.
Preferably, described multivariate table stores with showing probability, described is to calculate acquisition according to the adjacent word frequency with existing words in the dictionary that shows frequency and input method system of described at least two basic words with showing probability, described input method system also comprises second output unit: be used for calculating weighted value according to the same word frequency that shows the existing words of dictionary of probability and input method system of described prepare word, and, sort according to described weighted value, and ranking results is exported as candidate item.
Preferably, described multivariate table stores the strength of joint value, described strength of joint value is to obtain with showing frequency and showing probability calculation together according to the adjacent of described at least two basic words, described input method system also comprises the 3rd output unit: be used for calculating weighted value according to the word frequency of the existing words of dictionary of the strength of joint value of described prepare word and input method system, and, sort according to described weighted value, and ranking results is exported as candidate item.
Preferably, described internet corpus is internet blog corpus, internet news corpus and/or internet forum corpus.
Preferably, described input method system also comprises loading unit: be used for described multivariate table is loaded on memory device.
Preferably, described input method system also comprises cutting optimization unit: be used for the cutting method of described coded string is optimized.
Preferably, described input method system also comprises newly-increased acquiring unit: the coded string that is used for increasing newly according to the user obtains corresponding combined information at described multivariate table.
Preferably, the input interface unit of described input method system, display unit and multivariate table are arranged in same computing equipment; Perhaps, the input interface unit of described input method system, display unit are arranged in first computing equipment, multivariate table is arranged in second computing equipment, described input method system is according to the information of user's input, obtain corresponding combined information from the multivariate table that is arranged in second computing equipment, show corresponding words at first computing equipment.
The embodiment of the invention also provides a kind of device that generates multivariate table, comprise acquisition module: be used for from the internet corpus that presets, it is adjacent with existing combined information to obtain at least two basic words, and described combined information comprises collocation relation and the adjacent frequency that shows together between described at least two basic words; Generation module: be used for generating multivariate table according to described combined information.
Preferably, described device also comprises and chooses module: be used for choosing the basic words that meets prerequisite from the dictionary of input method system.
Preferably, described device comprises that also first removes module: be used for removing this combined information when the adjacent of a combined information when frequency is lower than certain threshold value with showing;
And/or second removes module: be used for removing when the corresponding words of a combined information is formed for the highest words of two or more word frequency this combined information;
And/or the 3rd removes module: be used for removing when a combined information is partly or entirely covered by another combined information this combined information.
Preferably, described device also comprises the webpage acquisition module: be used for obtaining pages of Internet by the web crawlers technology; Corpus generation module: be used to choose the info web that meets prerequisite, and preserve formation internet corpus.
Preferably, described internet corpus is internet blog corpus, internet news corpus and/or internet forum corpus.
The embodiment of the invention also provides a kind of method of upgrading input method system, comprising: upgrade the internet corpus; From the internet corpus that presets, it is adjacent with existing combined information to obtain at least two basic words, and described combined information comprises collocation relation and the adjacent frequency that shows together between described at least two basic words; Generate multivariate table according to described combined information; Described multivariate table is sent to input method system.
Preferably, described method also comprises: choose the basic words that meets prerequisite from the dictionary of input method system.
Preferably, described method before generating multivariate table, also comprises: if the adjacent frequency that shows together in the combined information is lower than certain threshold value, then remove this combined information; The highest words is formed if the corresponding words in the combined information is two or more word frequency, then removes this combined information; If one combined information is partly or entirely covered by another combined information, then remove this combined information.
Compared with prior art, the present invention has the following advantages:
At first, because the present invention is the basis of input method system output words to preset in the internet corpus, can accurately reflect the trend of people on language uses, can guarantee accuracy, the representativeness, comprehensive of combined information, import the first-selected speech hit rate of a plurality of words, phrase, phrase, short sentence or long sentence thereby improve the user, and then effectively improved user's input efficiency.
Next, the present invention is by generating the main channel of multivariate table as the output words, and technology realizes simply, does not have special secret algorithm, and can effectively avoid invalid, repeat calculation process, helps to economize on resources and raise the efficiency;
Moreover internet of the present invention corpus can be provided with arbitrarily, upgrade or be changed by those skilled in the art, thereby can obtain different intelligent word versions, to satisfy various users' different demands;
In addition, the present invention selects effective combined information to generate multivariate table by filtering rule is set, and can also avoid the redundancy of multivariate table, has effectively saved system resource;
At last, the present invention has also used multiple optimisation strategy, and to avoid invalid, the repeat calculation process of system, mitigation system is born, thereby has effectively improved user's input efficiency.
Description of drawings
Fig. 1 is the process flow diagram of embodiment of the invention method of intelligent word input in input method system;
Fig. 2 is the process flow diagram of the preferred embodiment of a kind of intelligent word input method of the present invention;
Fig. 3 is the structured flowchart of a kind of input method system embodiment of the present invention;
Fig. 4 is a kind of structured flowchart that generates the device embodiment of multivariate table of the present invention;
Fig. 5 is the process flow diagram that the device of application generation multivariate table shown in Figure 4 generates the preferred embodiment of multivariate table;
Fig. 6 is the process flow diagram that the present invention upgrades the embodiment 1 of input method system;
Fig. 7 is the process flow diagram that the present invention upgrades the embodiment 2 of input method system.
Embodiment
For above-mentioned purpose of the present invention, feature and advantage can be become apparent more, the present invention is further detailed explanation below in conjunction with the drawings and specific embodiments.
With reference to Fig. 1, be the process flow diagram of embodiment of the invention method of intelligent word input in input method system, may further comprise the steps:
Wherein, described combined information comprises collocation relation and the adjacent frequency that shows together between described at least two basic words;
Along with the quickening of social rhythm, constantly carrying out of cultural conflict and fusion causes the use of many vocabulary in the modern society, adopt the existing fixed corpus can not contain far away, especially along with the popularizing of internet, cause information to expand rapidly, the problems referred to above are more and more outstanding.Because fixedly the scale of corpus is less, content is fixed, the time that forms early, upgrade very slow, do not meet the activity that the internet uses according to its word frequency that gets, for example: the frequency that internet common wordss such as " top ", " network game ", " financial report " use is quite high, but in the prior art, after the general ordering of these vocabulary is all relatively leaned on, need the such demand of frequent use not to be inconsistent with the user.
In this case, present embodiment is by from preset the internet corpus, it is adjacent with existing combined information to obtain at least two basic words, promptly the statistics that is multivariate table with internet information open, real-time change is originated, when user's input information, the words that frequency of utilization is high on the internet can become the first-selected speech or the homepage candidate word of user's input, thereby improves user's input speed and efficient.
Those skilled in the art can preset described internet corpus as required arbitrarily, and for example, presetting described internet corpus is internet blog corpus, internet news corpus or internet forum corpus or the like.Be understandable that, can obtain different combined informations, thereby the output result who uses this input method system to obtain also might be different according to different internet corpus.Preferably, described internet corpus can also be changed, to satisfy various users' different demands.
Described basic words can derive from the collection of document (for example, traditional news, newspaper) of various specific sealings, and those skilled in the art select arbitrarily to get final product in practice as required.Preferably, from the dictionary of input method system, obtain described basic words.Possible is, although the dictionary of input method system comprises very huge word information, in fact only some is effective words, that is, and and the speech that usage frequency is higher and commonly used.Some is uncommon words or the low-down words of usage frequency.If calculate, obviously, can cause problems such as calculated amount is excessive, double counting is too much based on all basic words in the dictionary of input method system.
Need to prove, the dictionary of described input method system can be any dictionary of the prior art or its combination herein, also can be any dictionary that obtains according to presetting rule by those skilled in the art, and the memory location of described dictionary, for example, be present in server end or client, the present invention need not limit this.The system's dictionary, User Defined dictionary, general dictionary, specialized dictionary etc. that are appreciated that input method system described in the prior art are all within the dictionary scope of input method system of the present invention.
Therefore, preferably, present embodiment can also comprise step: choose the basic words that meets prerequisite from the dictionary of described input method system.For example, choose the words of TOP60000 in the dictionary of described input method system from high to low according to word frequency.Carry out subsequent treatment based on the words of choosing, can effectively avoid invalid, repeat calculation process, help to economize on resources and raise the efficiency.
Yet, the combined information that obtains based on the basic words of described screening still might comprise redundant or invalid combined information, for example, adjacent combined information, has the combined information of repetition implication or the combined information that partly or entirely covered etc. with underfrequency now, preferably, before generating multivariate table, present embodiment also comprises some optimization step, will describe in detail these optimization step hereinafter.
Need to prove that one of core idea that generates described multivariate table is: combined information according to after certain redundant rule elimination low value information, is kept the part of high value information as multivariate table.The multivariate table that generates according to combined information is meant that the variable of row or column is two or more tables.The form of described multivariate table can be as shown in the table:
Download-people | 0.0065 |
A lot-wooden horse | 1.6596 |
Sky-be covered with | 18.3775 |
Piece-guided missile | 46.1310 |
Come more-have more | 532276 |
Because of worker-disable | 347646.3438 |
In last table, the collocation relation between a plurality of words is shown in first tabulation, and secondary series is represented the connection parameter of this words collocation, and described connection parameter comprises adjacent with frequency now, with now probability or strength of joint value etc.Wherein, the described adjacent frequency that shows together can be added up acquisition from the internet corpus that presets, described can calculating by the adjacent word frequency with existing words in now frequency and the dictionary of described at least two basic words with probability now obtains, and described strength of joint value can be by obtaining with frequency now with probability calculation now according to the adjacent of described at least two basic words.Certainly, described connection parameter can be any numerical value that shows annexation between the words, and the present invention is not limited this, and in addition, the form of described multivariate table can be provided with arbitrarily as required, and the present invention does not need this to limit yet.
In practice, described multivariate table packing can also be stored in the described input method system, install and use to this locality in order to offer user's download.Those skilled in the art can be as required or are rule of thumb selected any storage mode to store, and this present invention is not limited.For example, described combined information and weighted value thereof are stored in the file according to the word order increment, wherein, described weighted value can be according to adjacent with showing frequency configuration, and adjacent with now frequency is big more, then this weighted value is big more.Use universal compressed algorithm then, as RAR compression algorithm, ZIP compression algorithm etc. with as described in file packing be stored to as described in the input method system.
Before the coded string that receives user's input, preferably, present embodiment can also comprise step: described multivariate table is loaded in the memory device.In this case,, then described multivariate table can be loaded in the internal memory if the user starts described input method system at local computer, thus the application performance of raising input method system.In case after loading, subsequently the read operation of data is all carried out in internal memory, be need not hard disk operation, thereby can effectively improve user's input speed and efficient.If input method system is the input method in network system, when the user uses, then described multivariate table can be loaded in the memory device of server, subsequently the read operation of data is all carried out based on the memory device of this server.
When the user used this input method system, this input method system can carry out cutting to the coded string of user's input, and described cutting can realize that the present invention does not need this to limit by adopting arbitrary cutting method of the prior art.
Preferably, present embodiment can also adopt some optimisation strategy that described input method system is optimized.Below be that example describes with several preferred optimisation strategy.
Optimisation strategy A: the cutting method to described coded string is optimized.For example, adopt branch and bound method that cutting method is carried out beta pruning.
The principle of work of branch and bound method is: at first determine the bound of desired value, cut some branch of search tree while searching for, improve search efficiency.Be applied in the embodiments of the invention, for a coded string, the method for a variety of cuttings arranged, for each cutting method, each coding also has the selection of a variety of possible words, if all calculate, calculated amount will be an astronomical figure.In this case, adopt described branch and bound method that the cutting method of each possible words is carried out probability calculation,, just stop current calculating, select a kind of down possibility if find that the possibility of this cutting method optimum is very little.Can effectively reduce calculated amount by described optimisation strategy A, the assurance system at the appointed time exports the result in the scope, thereby has effectively improved the treatment effeciency of system.
Certainly, those skilled in the art can be as required or are rule of thumb preset various optimisation strategy, and this present invention is not limited.
Preferably, can also comprise step in the present embodiment: the adjacent word frequency with existing words in the dictionary that shows frequency and input method system according to described prepare word is calculated together probability now, and, sort with showing probability according to described, and ranking results is exported as candidate item.Certainly, described ordering can also add other condition as required, and this present invention is not needed to limit.
Below with a kind of preferred be example with method for calculating probability now:
P(w
1,w
2,w
3,...,w
n)=P(w
1)*P(w
2)*...*P(w
n)*P(w
1,w
2)*P(w
2,w
3)*...*P(w
n-1,w
n);
Wherein, w
nBe a basic words, P (w
n) be the probability of this basis words, P (w
N-1, w
n) be the probability of two collocation relations between the adjacent foundation words.Can learn that present embodiment will consider the collocation relation between any two adjacent basic words for basic words above on two or two, calculates the product of all probability then.
For example, for two basic words A and B, then it is the probability of A, the probability of B and the product of the probability that AB occurs simultaneously with showing probability; For three basic words A, B and C, then it is the product of the probability of A, AB, B, BC, C with showing probability.
Above-mentioned algorithm is a statistics with a kind of algorithm of probability now, and those skilled in the art also can adopt other method with experience as required, as the method for direct storage N variable matrix etc.Said method only is used for for example, and the present invention is not limited to above-mentioned several method.
As another embodiment, when storing in the described multivariate table with probability now, present embodiment can comprise step: the word frequency according to existing words in the dictionary of the same existing probability of described prepare word and input method system is calculated weighted value, and, sort according to described weighted value, and ranking results is exported as candidate item.Wherein, preferably, described is to calculate acquisition according to the adjacent word frequency with existing words in the dictionary that shows frequency and input method system of described at least two basic words with showing probability, the described preparation method that shows probability together can adopt the method in the example, can adopt other method of the prior art, the present invention does not limit this yet.
As another embodiment, when storing the strength of joint value in the described multivariate table, present embodiment can comprise step: the word frequency according to existing words in the dictionary of the strength of joint value of described prepare word and input method system is calculated weighted value, and, sort according to described weighted value, and ranking results is exported as candidate item.Wherein, preferably, described strength of joint value is to obtain with showing frequency and showing probability calculation together according to the adjacent of described at least two basic words.
Certainly, can also store in the described multivariate table other any show the numerical value of annexation between the words, those skilled in the art rule of thumb maybe need to select for use to get final product, the present invention is not limited this.
A kind of possible situation is, the user is newly-increased input coding character string on the basis of the coded string of original input, at this situation, present embodiment can also optimizing application strategy B: only the coded string that increases newly according to the user obtains corresponding combined information in described multivariate table; Make the calculating of system only limit to change part, avoid system's repetitive operation.For example, user's input Pinyin coded string zhongguorenminjiefang (Chinese people's liberation), user's this moment input alphabet j again, the letter " j " that then adopts optimisation strategy B only to increase newly according to the user obtains corresponding combined information (as " army ", " monarch ", " machine " etc.) in described multivariate table, and need not to repeat to obtain the corresponding combined information of Pinyin coding character string " zhongguorenminjiefang " of front again.
In order to improve the effective rate of utilization of input method system, the all right optimizing application strategy C of present embodiment: the computing time of preset system, as 100ms or 50ms, in described preset time, finish calculating in order to control system, calculate as yet and finish if surpassed described preset time system, will finish then that screen shows on the result of calculation of part.For example, user's input Pinyin coded string " renshengzigushuiwusi ", when surpassing 50ms, input method system of the present invention only gets access to " renshengzigushuiwu " corresponding prepare word and is " life does not have whom ", " life ", " voice ", " shy with strangers " etc. from ancient times, but the calculating for " si " is not finished as yet, when using described optimisation strategy C, input method system then of the present invention only will screen demonstration on the above-mentioned prepare word that has got access to.One of core concept of this processing mode is the control of the background process of input method system and foreground separated and handles, so just can guarantee to be installed in described input method system on the different machines or the different loads of uniform machinery under effect be the same.
Preferably, described optimisation strategy A, B and C are used in combination in input method system.Certainly, those skilled in the art can only adopt a kind of optimisation strategy, also can adopt multiple; In multiple optimisation strategy, can carry out combination in any.In addition, those skilled in the art can also set up other various optimisation strategy as required on their own, and the present invention does not limit this.
For the memory source that makes described input method system be convenient to Network Transmission, reduces the user takies and improves system handles efficient, the words in the dictionary of combined information in the described multivariate table and described input method system can also be compared in the present embodiment; If there is the words that repeats with described combined information in the described dictionary, then in the input method dictionary, remove this words.For example, for phonetic shangwuhuiyi, corresponding combined information is: " business meetings ", " meeting in the morning ", " memory at noon ", " memory in the morning ", " commercial affairs are understanding " etc.; If the words that a correspondence arranged in the dictionary of input method system is " business meetings ", then with combined information in " business meetings " repeat, in this case, can remove " business meetings " in the dictionary.
With reference to figure 2, be the process flow diagram of the preferred embodiment of a kind of intelligent word input method of the present invention, comprise and preset step and input step, specifically, comprising:
One, preset step:
Step 201, obtain pages of Internet by the web crawlers technology;
For example,,, grasp in the internet nearly 4,000,000,000 up-to-date webpage in real time, can comprise Internet news in these internet web pages, forum, blog, chatroom or the like Web content according to website domain name tabulation by tens web crawlers servers.
Step 202, choose the info web that meets prerequisite, and preserve and form the internet corpus;
For example, select 4,000 ten thousand internet web pages, the mass network page corpus that original language material scale surpasses 1Terabyte is a described internet corpus.
Because present embodiment serves as the basis of output words with open, internet information real-time change, the multivariate table that generates can accurately reflect the trend of people on language uses, can guarantee accuracy, the representativeness and comprehensive of combined information, import the first-selected speech hit rate of a plurality of words, phrase, phrase, short sentence or long sentence thereby improve the user, and then effectively improved user's input efficiency.
Certainly, those skilled in the art can be as required or are rule of thumb selected any method to preset described internet corpus, and this present invention is not limited.And the described method that presets the internet corpus also can for example, be updated to news corpus, blog corpus or forum's corpus etc. with described internet corpus for upgrading the method for internet corpus, and the present invention does not also limit this.
Step 203, from the dictionary of input method system, choose the basic words that meets prerequisite;
For example, choose the words of TOP60000 in the dictionary of described input method system from high to low according to word frequency.
Step 204, from the internet corpus that presets, it is adjacent with existing combined information to obtain at least two basic words;
Wherein, described combined information comprises collocation relation and the adjacent frequency that shows together between described at least two basic words;
If the adjacent frequency that shows together in step 205 combined information is lower than certain threshold value, then remove this combined information;
For example, if the adjacent of a combined information is lower than 0.001 combined information with showing frequency, then remove this combined information.Removing the adjacent lower threshold value of frequency that shows does not together influence user's general operation, but can save system resource well, and mitigation system is born, thereby effectively improves the treatment effeciency of system.
The highest words is formed if the corresponding words in step 206 combined information is two or more word frequency, then removes this combined information;
For example, for phonetic: qinghuadaxuebiye; The combined information that gets access to is: Tsing-Hua University's graduation, however the first-selected speech for phonetic " qinghua " correspondence is " Tsing-Hua University " in the dictionary of input method system; The first-selected speech corresponding for phonetic " daxue " is " university "; The first-selected speech corresponding for phonetic " biye " is " graduation ", in this case, even this combined information does not exist, also can not influence its first-selected effect, therefore this combined information can be removed.
If combined information of step 207 is partly or entirely covered by another combined information, then remove this combined information;
For example, for phonetic: wohenkaixin; The combined information that gets access to is: I am very happy, if for phonetic: henkaixin, an existing combined information is: very happy; Because phonetic " wo " existing first-selected speech in the dictionary of input method system is " I ", can draw so, " very happy " this combined information can partly cover " I am very happy " this combined information, in this case, even " I am very happy " this combined information does not exist, can not influence its first-selected effect yet, therefore this combined information can be removed.Be understandable that, also can remove for the combined information that repeats fully.
By above-mentioned steps 205-step 207, can effectively avoid redundant information and invalid information in the combined information, help the effective rate of utilization of mitigation system burden, saving system space and resource, raising system.
Need to prove that above-mentioned steps 205-step 207 can be provided with separately or the combination in any setting as required, that is to say that those skilled in the art can only adopt an independent step, also can adopt a plurality of steps; In above-mentioned steps, can carry out combination in any, there is not sequence limit yet.In addition, those skilled in the art can also set up other various presetting rule as required on their own, and the present invention does not limit this.For example, other possible selection rule is: the removal string length is less than or equal to the combined information (user is not intended to input etc.) of preset threshold value etc.
The combined information that step 208, basis filter out generates multivariate table.
Two, input step:
Step 209, described multivariate table is loaded in the memory device;
Step 210, receive the coded string of user's input, and described coded string is carried out cutting;
Can also optimize the unit by cutting at this cutting method of described coded string is optimized, for example, adopt branch and bound method that cutting method is carried out beta pruning.
Step 211, in described multivariate table, obtain corresponding combined information, and the corresponding words that extracts corresponding collocation relation in the described combined information is a prepare word according to the coded string after the described cutting.
If the user is newly-increased input coding character string on the basis of the coded string of original input, then present embodiment can also obtain corresponding combined information according to the coded string that the user increases newly in described multivariate table.Make the processing of system only limit to change part, avoid system's repetitive operation.
Step 212, calculate with probability now according to the adjacent word frequency of described prepare word with existing words in the dictionary of now frequency and input method system;
Step 213, sort with probability now, and ranking results is exported as candidate item according to described.
As another embodiment, when storing in the described multivariate table with probability now, described step 212 and step 213 can for: the word frequency according to existing words in the dictionary of the same existing probability of described prepare word and input method system is calculated weighted value; Sort according to described weighted value, and ranking results is exported as candidate item.
As another embodiment, when storing the strength of joint value in the described multivariate table, described step 212 and step 213 can for: the word frequency according to existing words in the dictionary of the strength of joint value of described prepare word and input method system is calculated weighted value; Sort according to described weighted value, and ranking results is exported as candidate item.
Describing not detailed part for method shown in Figure 2 can be referring to the description of this instructions front appropriate section.
With reference to figure 3, be the structured flowchart of a kind of input method system embodiment of the present invention, comprise input interface unit 301 and display unit 302; Described input method system also comprises:
Multivariate table 303: described multivariate table is adjacent with existing combined information generation by at least two basic words; Described combined information obtains from the internet corpus that presets, and comprises collocation relation and the adjacent frequency that shows together between described at least two basic words;
Cutting unit 304: be used for the coded string of user's input is carried out cutting;
Extraction unit 305: be used for obtaining corresponding combined information at described multivariate table, and the corresponding words that extracts corresponding collocation relation in the described combined information is a prepare word according to the coded string after the described cutting.
Preferably, described input method system also comprises first output unit: be used for calculating together probability now according to the adjacent word frequency with the existing words of the dictionary that shows frequency and input method system of described prepare word, and, sort with showing probability according to described, and ranking results is exported as candidate item.
As another embodiment, when storing in the described multivariate table with probability now, described input method system also comprises second output unit: be used for calculating weighted value according to the same word frequency that shows the existing words of dictionary of probability and input method system of described prepare word, and, sort according to described weighted value, and ranking results is exported as candidate item.Wherein, described is to calculate acquisition according to the adjacent word frequency with existing words in the dictionary that shows frequency and input method system of described at least two basic words with showing probability.
As another embodiment, when storing the strength of joint value in the described multivariate table, described input method system also comprises the 3rd output unit: be used for calculating weighted value according to the word frequency of the existing words of dictionary of the strength of joint value of described prepare word and input method system, and, sort according to described weighted value, and ranking results is exported as candidate item.Wherein, described strength of joint value is to obtain with showing frequency and showing probability calculation together according to the adjacent of described at least two basic words.
Certainly, can also store in the described multivariate table other any show the numerical value of annexation between the words, those skilled in the art rule of thumb maybe need to select for use to get final product, the present invention is not limited this.
Preferably, described internet corpus is internet blog corpus, internet news corpus and/or internet forum corpus.
Present embodiment is by with the internet corpus being the basis of input method system output words, the combined information that generates can accurately reflect the trend of people on language uses, can guarantee accuracy, the representativeness and comprehensive of combined information, import the first-selected speech hit rate of a plurality of words, phrase, phrase, short sentence or long sentence thereby improve the user, and then effectively improved user's input efficiency.
Preferably, described input method system can also comprise loading unit: be used for described multivariate table is loaded on memory device.This memory device can be the memory device of client, also can be the memory device of server end.
Invalid, repeat calculation process in the present embodiment, effectively conserve system resources, improve the treatment effeciency of system, described input method system can also comprise following system optimization unit:
The unit is optimized in cutting: be used for the cutting method of described coded string is optimized;
And/or, newly-increased acquiring unit: be used for obtaining corresponding combined information at described multivariate table according to newly-increased coded string.
Said system is optimized also combination in any use as required of unit, and those skilled in the art can only use a system optimization unit, also can adopt a plurality of system optimization unit; And in described a plurality of system optimization unit, can carry out combination in any.In addition, those skilled in the art can also set up other various system optimization unit as required on their own, and the present invention does not limit this.
For the memory source that makes described input method system be convenient to Network Transmission, reduces the user takies and improves system handles efficient, the words in the dictionary of combined information in the described multivariate table and described input method system can also be compared in the present embodiment; If there is the words that repeats with described combined information in the described dictionary, then in the input method dictionary, remove this words.Thereby make that the input method system installation kit file that generates is less, greatly reduced the user and used threshold, reduced taking of user storage space, and effectively improved the service efficiency of system.
Input method system shown in Figure 3 can be common input method system, and in this case, the input interface unit of described input method system, display unit and multivariate table are arranged in same computing equipment;
Input method system shown in Figure 3 also can be the input method in network system, in this case, the input interface unit of described input method system, display unit are arranged in first computing equipment, multivariate table is arranged in second computing equipment, described input method system is according to the information of user's input, obtain corresponding combined information from the multivariate table that is arranged in second computing equipment, show corresponding words at first computing equipment.
Because system shown in Figure 3 can corresponding be applicable among the embodiment of aforesaid the whole bag of tricks that so description is comparatively simple, not detailed part can be referring to the description of this instructions front appropriate section.
With reference to figure 4, be a kind of structured flowchart that generates the device embodiment of multivariate table of the present invention, comprise with lower module:
Acquisition module 401: be used for from the internet corpus that presets, it is adjacent with existing combined information to obtain at least two basic words;
Wherein, described combined information comprises collocation relation and the adjacent frequency that shows together between described at least two basic words;
Generation module 402: be used for generating multivariate table according to described combined information.
For fear of invalid, repeat calculation process, preferably, the device of present embodiment can also comprise chooses module 403: be used for choosing the basic words that meets prerequisite from the dictionary of input method system.
Based on one of core idea of multivariate table of the present invention, combined information according to after certain redundant rule elimination low value information, is kept the part of high value information as multivariate table.Preferably, the device of present embodiment can also comprise: first removes module 404: be used for removing this combined information when the adjacent of a combined information when frequency is lower than certain threshold value with showing; And/or second removes module 405: be used for removing when the corresponding words of a combined information is formed for the highest words of two or more word frequency this combined information; And/or the 3rd removes module 406: be used for removing when a combined information is partly or entirely covered by another combined information this combined information.Above-mentioned removal module 404-406 can be single as required or be used in combination, and the present invention does not limit this.
For the multivariate table that makes generation can accurately reflect the trend of people on language uses, can guarantee the representativeness, comprehensive of combined information, thereby improve the first-selected speech hit rate that the user imports a plurality of words, phrase, phrase, short sentence or long sentence, preferably, the device of present embodiment can also comprise webpage acquisition module 407: be used for obtaining pages of Internet by the web crawlers technology; With corpus generation module 408: be used to choose the info web that meets prerequisite, and preserve form the internet corpus.More preferably, described internet corpus is internet blog corpus, internet news corpus and/or internet forum corpus.And can be provided with arbitrarily, upgrade and change by those skilled in the art, this present invention is not needed to limit.
With reference to figure 5, be the process flow diagram that the device of using generation multivariate table shown in Figure 4 generates the preferred embodiment of multivariate table, may further comprise the steps:
Wherein, described internet corpus is internet blog corpus, internet news corpus and/or internet forum corpus.Can also be provided with arbitrarily, upgrade and change by those skilled in the art.
For example, choose the words of TOP60000 in the dictionary of described input method system from high to low according to word frequency.
Wherein, described combined information comprises collocation relation and the adjacent frequency that shows together between described at least two basic words;
Because method shown in Figure 5 can corresponding be applicable among the embodiment of aforesaid the whole bag of tricks and system that so description is comparatively simple, not detailed part can be referring to the description of this instructions front appropriate section.
With reference to figure 6, be the process flow diagram that the present invention upgrades the embodiment 1 of input method system, may further comprise the steps:
Those skilled in the art can rule of thumb select any to upgrade the algorithm of internet corpus with needs, and present embodiment does not limit at this.
Preferably, described internet corpus is internet blog corpus, internet news corpus and/or internet forum corpus.Can also be provided with arbitrarily, upgrade and change by those skilled in the art.
Wherein, described combined information comprises collocation relation and the adjacent frequency that shows together between described at least two basic words.
With reference to figure 7, be a kind of process flow diagram that upgrades the embodiment 2 of input method system of the present invention, may further comprise the steps:
Wherein, described internet corpus can be internet blog corpus, internet news corpus and/or internet forum corpus.Can also be provided with arbitrarily, upgrade and change by those skilled in the art.
For example, choose the words of TOP60000 in the dictionary of described input method system from high to low according to word frequency.
Wherein, described combined information comprises collocation relation and the adjacent frequency that shows together between described at least two basic words.
If the adjacent frequency that shows together in step 704 combined information is lower than certain threshold value, then remove this combined information;
The highest words is formed if the corresponding words in step 705 combined information is two or more word frequency, then removes this combined information;
If combined information of step 706 is partly or entirely covered by another combined information, then remove this combined information;
The combined information that step 707, basis filter out generates multivariate table;
As another embodiment, described step 704-step 706 can be provided with or make up setting as required separately, and the present invention does not need this to limit.
In the above-described embodiments, the description of each embodiment is all emphasized particularly on different fields, do not have the part that describes in detail among certain embodiment, can get final product referring to the associated description of aforementioned part.Above-mentionedly arbitrarily enumerated several embodiment of the present invention, those skilled in the art are appropriate combination, selection as the case may be, can bring into play technology effect of the present invention fully.Combination in any based on the foregoing description all is embodiment of the present invention, but this instructions has not just described in detail one by one at this as space is limited.
Because Fig. 6 and method shown in Figure 7 can corresponding be applicable among the embodiment of aforesaid the whole bag of tricks and system that so description is comparatively simple, not detailed part can be referring to the description of this instructions front appropriate section.
More than the method for a kind of intelligent word provided by the present invention, a kind of input method system, a kind of device and a kind of method of upgrading input method system that generates multivariate table are described in detail, used specific case herein principle of the present invention and embodiment are set forth, the explanation of above embodiment just is used for helping to understand method of the present invention and core concept thereof; Simultaneously, for one of ordinary skill in the art, according to thought of the present invention, the part that all can change in specific embodiments and applications, in sum, this description should not be construed as limitation of the present invention.
Claims (28)
1, a kind of method of intelligent word input is characterized in that, comprising:
From the internet corpus that presets, it is adjacent with existing combined information to obtain at least two basic words, and described combined information comprises collocation relation and the adjacent frequency that shows together between described at least two basic words;
Generate multivariate table according to described combined information;
Receive the coded string of user's input, and described coded string is carried out cutting;
In described multivariate table, obtain corresponding combined information according to the coded string after the described cutting, and the corresponding words that extracts corresponding collocation relation in the described combined information is a prepare word.
2, the method for claim 1 is characterized in that, also comprises:
Calculate together probability now according to the adjacent of described prepare word with the word frequency that has words in the dictionary that shows frequency and input method system, and, sort according to the described probability that shows together, and ranking results is exported as candidate item.
3, the method for claim 1, it is characterized in that, described multivariate table stores together probability now, and described is to calculate acquisition according to the adjacent word frequency with existing words in the dictionary that shows frequency and input method system of described at least two basic words with showing probability, and described method also comprises:
Word frequency according to existing words in the dictionary of the same existing probability of described prepare word and input method system is calculated weighted value, and, sort according to described weighted value, and ranking results is exported as candidate item.
4, the method for claim 1 is characterized in that, described multivariate table stores the strength of joint value, and described strength of joint value is to obtain with showing frequency and showing probability calculation together according to the adjacent of described at least two basic words, and described method also comprises:
Word frequency according to existing words in the dictionary of the strength of joint value of described prepare word and input method system is calculated weighted value, and, sort according to described weighted value, and ranking results is exported as candidate item.
5, the method for claim 1 is characterized in that, also comprises:
From the dictionary of input method system, choose the basic words that meets prerequisite.
6, as the described method of above-mentioned each claim, it is characterized in that, before generating multivariate table, also comprise:
If the adjacent frequency that shows together in the combined information is lower than certain threshold value, then remove this combined information;
The highest words is formed if the corresponding words in the combined information is two or more word frequency, then removes this combined information;
If one combined information is partly or entirely covered by another combined information, then remove this combined information.
7, as the described method of above-mentioned each claim, it is characterized in that, preset described internet corpus by following steps:
Obtain pages of Internet by the web crawlers technology;
Choose the info web that meets prerequisite, and preserve formation internet corpus.
8, method as claimed in claim 7 is characterized in that, described internet corpus is internet blog corpus, internet news corpus and/or internet forum corpus.
9, the method for claim 1 is characterized in that, also comprises step before the coded string that receives user's input: described multivariate table is loaded in the memory device.
10, the method for claim 1 is characterized in that, also comprises:
Cutting method to described coded string is optimized.
11, as claim 1 or 10 described methods, it is characterized in that, also comprise:
The coded string that increases newly according to the user obtains corresponding combined information in described multivariate table.
12, a kind of input method system comprises input interface unit and display unit, it is characterized in that, described input method system also comprises:
Multivariate table: described multivariate table is adjacent with existing combined information generation by at least two basic words; Described combined information obtains from the internet corpus that presets, and comprises collocation relation and the adjacent frequency that shows together between described at least two basic words;
Cutting unit: be used for the coded string of user's input is carried out cutting;
Extraction unit: be used for obtaining corresponding combined information at described multivariate table, and the corresponding words that extracts corresponding collocation relation in the described combined information is a prepare word according to the coded string after the described cutting.
13, system as claimed in claim 12 is characterized in that, described input method system also comprises:
First output unit: be used for calculating with probability now with the word frequency of the existing words of dictionary of frequency and input method system now according to the adjacent of described prepare word, and, sort with probability now according to described, and ranking results is exported as candidate item.
14, system as claimed in claim 12, it is characterized in that, described multivariate table stores with showing probability, described is to calculate acquisition according to the adjacent word frequency with existing words in the dictionary that shows frequency and input method system of described at least two basic words with showing probability, and described input method system also comprises:
Second output unit: be used for calculating weighted value according to the same word frequency that shows the existing words of dictionary of probability and input method system of described prepare word, and, sort according to described weighted value, and ranking results is exported as candidate item.
15, system as claimed in claim 12, it is characterized in that, described multivariate table stores the strength of joint value, and described strength of joint value is to obtain with showing frequency and showing probability calculation together according to the adjacent of described at least two basic words, and described input method system also comprises:
The 3rd output unit: be used for calculating weighted value according to the word frequency of the existing words of dictionary of the strength of joint value of described prepare word and input method system, and, sort according to described weighted value, and ranking results is exported as candidate item.
16, system as claimed in claim 12 is characterized in that, described internet corpus is internet blog corpus, internet news corpus and/or internet forum corpus.
17, system as claimed in claim 12 is characterized in that, described input method system also comprises:
Loading unit: be used for described multivariate table is loaded on memory device.
18, system as claimed in claim 12 is characterized in that, described input method system also comprises:
The unit is optimized in cutting: be used for the cutting method of described coded string is optimized.
19, as claim 12 or 18 described systems, it is characterized in that described input method system also comprises:
Newly-increased acquiring unit: the coded string that is used for increasing newly according to the user obtains corresponding combined information at described multivariate table.
20, system as claimed in claim 12 is characterized in that, the input interface unit of described input method system, display unit and multivariate table are arranged in same computing equipment;
Perhaps, the input interface unit of described input method system, display unit are arranged in first computing equipment, multivariate table is arranged in second computing equipment, described input method system is according to the information of user's input, obtain corresponding combined information from the multivariate table that is arranged in second computing equipment, show corresponding words at first computing equipment.
21, a kind of device that generates multivariate table is characterized in that, comprising:
Acquisition module: be used for from the internet corpus that presets, it is adjacent with existing combined information to obtain at least two basic words, and described combined information comprises collocation relation and the adjacent frequency that shows together between described at least two basic words;
Generation module: be used for generating multivariate table according to described combined information.
22, device as claimed in claim 21 is characterized in that, also comprises:
Choose module: be used for choosing the basic words that meets prerequisite from the dictionary of input method system.
23, as claim 21 or 22 described devices, it is characterized in that, also comprise:
First removes module: be used for removing this combined information when the adjacent of a combined information when frequency is lower than certain threshold value with showing;
And/or second removes module: be used for removing when the corresponding words of a combined information is formed for the highest words of two or more word frequency this combined information;
And/or the 3rd removes module: be used for removing when a combined information is partly or entirely covered by another combined information this combined information.
24, as claim 21 or 22 described devices, it is characterized in that, also comprise:
Webpage acquisition module: be used for obtaining pages of Internet by the web crawlers technology;
Corpus generation module: be used to choose the info web that meets prerequisite, and preserve formation internet corpus.
25, device as claimed in claim 24 is characterized in that, described internet corpus is internet blog corpus, internet news corpus and/or internet forum corpus.
26, a kind of method of upgrading input method system is characterized in that, comprising:
Upgrade the internet corpus;
From the internet corpus that presets, it is adjacent with existing combined information to obtain at least two basic words, and described combined information comprises collocation relation and the adjacent frequency that shows together between described at least two basic words;
Generate multivariate table according to described combined information;
Described multivariate table is sent to input method system.
27, method as claimed in claim 26 is characterized in that, also comprises:
From the dictionary of input method system, choose the basic words that meets prerequisite.
28, as claim 26 or 27 described methods, it is characterized in that, before generating multivariate table, also comprise:
If the adjacent frequency that shows together in the combined information is lower than certain threshold value, then remove this combined information;
The highest words is formed if the corresponding words in the combined information is two or more word frequency, then removes this combined information;
If one combined information is partly or entirely covered by another combined information, then remove this combined information.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CNB2007100792674A CN100458795C (en) | 2007-02-13 | 2007-02-13 | Intelligent word input method and input method system and updating method thereof |
PCT/CN2008/070270 WO2008098507A1 (en) | 2007-02-13 | 2008-02-03 | An input method of combining words intelligently, input method system and renewing method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CNB2007100792674A CN100458795C (en) | 2007-02-13 | 2007-02-13 | Intelligent word input method and input method system and updating method thereof |
Publications (2)
Publication Number | Publication Date |
---|---|
CN101013443A true CN101013443A (en) | 2007-08-08 |
CN100458795C CN100458795C (en) | 2009-02-04 |
Family
ID=38700955
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CNB2007100792674A Expired - Fee Related CN100458795C (en) | 2007-02-13 | 2007-02-13 | Intelligent word input method and input method system and updating method thereof |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN100458795C (en) |
WO (1) | WO2008098507A1 (en) |
Cited By (37)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101882128A (en) * | 2010-06-11 | 2010-11-10 | 宇龙计算机通信科技(深圳)有限公司 | Method for generating commonly used terms of information and mobile terminal |
CN101661463B (en) * | 2009-09-18 | 2011-04-06 | 杨盛 | Automatic collating method in character input process |
CN102193639A (en) * | 2010-03-04 | 2011-09-21 | 阿里巴巴集团控股有限公司 | Method and device of statement generation |
CN102402298A (en) * | 2010-09-16 | 2012-04-04 | 腾讯科技(深圳)有限公司 | Pinyin input method and user word adding method and system of same |
CN101556596B (en) * | 2007-08-31 | 2012-04-18 | 北京搜狗科技发展有限公司 | Input method system and intelligent word making method |
CN102455786A (en) * | 2010-10-25 | 2012-05-16 | 三星电子(中国)研发中心 | System and method for optimizing Chinese sentence input method |
CN102541278A (en) * | 2010-12-25 | 2012-07-04 | 上海量明科技发展有限公司 | Method and system for character selection in word input interface |
CN102566775A (en) * | 2010-12-31 | 2012-07-11 | 上海量明科技发展有限公司 | Input method and system for generating character interval |
CN102567365A (en) * | 2010-12-26 | 2012-07-11 | 上海量明科技发展有限公司 | Input method and input system based on labeling specific to a keyword |
CN102945086A (en) * | 2012-11-22 | 2013-02-27 | 黑龙江大学 | Super input action capture record system and capture record method |
CN103024159A (en) * | 2012-11-28 | 2013-04-03 | 东莞宇龙通信科技有限公司 | Information generation method and information generation system |
CN103064967A (en) * | 2012-12-31 | 2013-04-24 | 百度在线网络技术(北京)有限公司 | Method and device used for establishing user binary relation bases |
CN103076892A (en) * | 2012-12-31 | 2013-05-01 | 百度在线网络技术(北京)有限公司 | Method and equipment for providing input candidate items corresponding to input character string |
CN103473036A (en) * | 2012-06-08 | 2013-12-25 | 深圳市世纪光速信息技术有限公司 | Input method skin push method and system |
CN103869999A (en) * | 2012-12-11 | 2014-06-18 | 百度国际科技(深圳)有限公司 | Method and device for sorting candidate items generated by input method |
CN103929448A (en) * | 2013-01-14 | 2014-07-16 | 百度国际科技(深圳)有限公司 | Method, system and device for providing cell word stock in cloud server |
CN103927299A (en) * | 2014-04-25 | 2014-07-16 | 百度在线网络技术(北京)有限公司 | Method for providing candidate sentences in input method and method and device for recommending input content |
CN104360759A (en) * | 2014-11-21 | 2015-02-18 | 百度在线网络技术(北京)有限公司 | Candidate character sequencing method and device as well as character input method and equipment |
CN105095191A (en) * | 2014-04-22 | 2015-11-25 | 富士通株式会社 | Method and device for assisted translation based on multi-word units |
CN105607753A (en) * | 2015-12-15 | 2016-05-25 | 上海嵩恒网络科技有限公司 | Long sentence input method and long sentence input system for five strokes |
CN106445177A (en) * | 2015-08-06 | 2017-02-22 | 阿尔派株式会社 | Character input device and character input method |
CN106557178A (en) * | 2016-11-29 | 2017-04-05 | 百度国际科技(深圳)有限公司 | For updating the method and device of input method entry |
CN107340881A (en) * | 2016-05-03 | 2017-11-10 | 北京搜狗科技发展有限公司 | A kind of input method and electronic equipment |
CN107422872A (en) * | 2016-05-24 | 2017-12-01 | 北京搜狗科技发展有限公司 | A kind of input method, device and the device for input |
CN107688398A (en) * | 2016-08-03 | 2018-02-13 | 中国科学院计算技术研究所 | Determine the method and apparatus and input reminding method and device of candidate's input |
CN108073292A (en) * | 2016-11-11 | 2018-05-25 | 北京搜狗科技发展有限公司 | A kind of intelligent word method and apparatus, a kind of device for intelligent word |
CN108073293A (en) * | 2016-11-11 | 2018-05-25 | 北京搜狗科技发展有限公司 | A kind of definite method and apparatus of target phrase |
CN108241438A (en) * | 2016-12-23 | 2018-07-03 | 北京搜狗科技发展有限公司 | A kind of input method, device and the device for input |
CN108628906A (en) * | 2017-03-24 | 2018-10-09 | 北京京东尚科信息技术有限公司 | Short text template method for digging, device, electronic equipment and readable storage medium storing program for executing |
CN108803890A (en) * | 2017-04-28 | 2018-11-13 | 北京搜狗科技发展有限公司 | A kind of input method, input unit and the device for input |
CN109144284A (en) * | 2017-06-15 | 2019-01-04 | 百度在线网络技术(北京)有限公司 | information display method and device |
CN109426358A (en) * | 2017-09-01 | 2019-03-05 | 百度在线网络技术(北京)有限公司 | Data inputting method and device |
CN109542243A (en) * | 2017-09-21 | 2019-03-29 | 北京搜狗科技发展有限公司 | Phrase composing method and device, for the device of group word |
CN109917927A (en) * | 2017-12-13 | 2019-06-21 | 北京搜狗科技发展有限公司 | A kind of candidate item determines method and apparatus |
CN109961791A (en) * | 2017-12-22 | 2019-07-02 | 北京搜狗科技发展有限公司 | A kind of voice information processing method, device and electronic equipment |
CN110781288A (en) * | 2019-10-30 | 2020-02-11 | 安阳师范学院 | Method and device for composing words by Chinese characters |
CN111913591A (en) * | 2020-06-23 | 2020-11-10 | 杭州电子科技大学 | Reply phrase generation method, pinyin input method and intelligent terminal |
Families Citing this family (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101895631A (en) * | 2010-07-09 | 2010-11-24 | 深圳市五巨科技有限公司 | Method, device and system for intelligently switching input method by mobile terminal |
CN102508554A (en) * | 2011-10-02 | 2012-06-20 | 上海量明科技发展有限公司 | Input method with communication association, personal repertoire and system |
CN102495679A (en) * | 2011-12-01 | 2012-06-13 | 上海量明科技发展有限公司 | Composite spelling input method, word bank and system thereof |
CN104281274A (en) * | 2014-09-03 | 2015-01-14 | 深圳市金立通信设备有限公司 | Input method |
CN107122060A (en) * | 2017-03-15 | 2017-09-01 | 韦柳志 | A kind of method that candidate item is handled in input method |
CN109213988B (en) * | 2017-06-29 | 2022-06-21 | 武汉斗鱼网络科技有限公司 | Barrage theme extraction method, medium, equipment and system based on N-gram model |
CN112199031B (en) * | 2020-10-15 | 2022-08-05 | 科大讯飞股份有限公司 | Input method, device, equipment and storage medium |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1226717C (en) * | 2000-08-30 | 2005-11-09 | 国际商业机器公司 | Automatic new term fetch method and system |
US7478033B2 (en) * | 2004-03-16 | 2009-01-13 | Google Inc. | Systems and methods for translating Chinese pinyin to Chinese characters |
CN100401301C (en) * | 2006-05-30 | 2008-07-09 | 南京大学 | Body learning based intelligent subject-type network reptile system configuration method |
-
2007
- 2007-02-13 CN CNB2007100792674A patent/CN100458795C/en not_active Expired - Fee Related
-
2008
- 2008-02-03 WO PCT/CN2008/070270 patent/WO2008098507A1/en active Application Filing
Cited By (60)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101556596B (en) * | 2007-08-31 | 2012-04-18 | 北京搜狗科技发展有限公司 | Input method system and intelligent word making method |
CN101661463B (en) * | 2009-09-18 | 2011-04-06 | 杨盛 | Automatic collating method in character input process |
CN102193639A (en) * | 2010-03-04 | 2011-09-21 | 阿里巴巴集团控股有限公司 | Method and device of statement generation |
CN101882128A (en) * | 2010-06-11 | 2010-11-10 | 宇龙计算机通信科技(深圳)有限公司 | Method for generating commonly used terms of information and mobile terminal |
CN102402298A (en) * | 2010-09-16 | 2012-04-04 | 腾讯科技(深圳)有限公司 | Pinyin input method and user word adding method and system of same |
CN102455786A (en) * | 2010-10-25 | 2012-05-16 | 三星电子(中国)研发中心 | System and method for optimizing Chinese sentence input method |
CN102455786B (en) * | 2010-10-25 | 2014-09-03 | 三星电子(中国)研发中心 | System and method for optimizing Chinese sentence input method |
CN102541278A (en) * | 2010-12-25 | 2012-07-04 | 上海量明科技发展有限公司 | Method and system for character selection in word input interface |
CN102567365A (en) * | 2010-12-26 | 2012-07-11 | 上海量明科技发展有限公司 | Input method and input system based on labeling specific to a keyword |
CN102567365B (en) * | 2010-12-26 | 2016-07-06 | 上海量明科技发展有限公司 | A kind of it is directed to input method and the system that key word is labeled |
CN102566775A (en) * | 2010-12-31 | 2012-07-11 | 上海量明科技发展有限公司 | Input method and system for generating character interval |
CN103473036A (en) * | 2012-06-08 | 2013-12-25 | 深圳市世纪光速信息技术有限公司 | Input method skin push method and system |
CN103473036B (en) * | 2012-06-08 | 2018-04-27 | 深圳市世纪光速信息技术有限公司 | A kind of input method skin method for pushing and system |
CN102945086A (en) * | 2012-11-22 | 2013-02-27 | 黑龙江大学 | Super input action capture record system and capture record method |
CN102945086B (en) * | 2012-11-22 | 2015-09-09 | 黑龙江大学 | Super input behavior is grabbed recording system and is grabbed recording method |
CN103024159B (en) * | 2012-11-28 | 2015-01-21 | 东莞宇龙通信科技有限公司 | Information generation method and information generation system |
CN103024159A (en) * | 2012-11-28 | 2013-04-03 | 东莞宇龙通信科技有限公司 | Information generation method and information generation system |
CN103869999B (en) * | 2012-12-11 | 2018-10-16 | 百度国际科技(深圳)有限公司 | The method and device that candidate item caused by input method is ranked up |
CN103869999A (en) * | 2012-12-11 | 2014-06-18 | 百度国际科技(深圳)有限公司 | Method and device for sorting candidate items generated by input method |
CN103076892A (en) * | 2012-12-31 | 2013-05-01 | 百度在线网络技术(北京)有限公司 | Method and equipment for providing input candidate items corresponding to input character string |
US20150293972A1 (en) * | 2012-12-31 | 2015-10-15 | Baidu Online Network Technology (Beijing) Co., Ltd. | Method and device used for providing input candidate items corresponding to an input character string |
CN103064967B (en) * | 2012-12-31 | 2018-10-12 | 百度在线网络技术(北京)有限公司 | A kind of method and apparatus for establishing user's binary crelation library |
CN103064967A (en) * | 2012-12-31 | 2013-04-24 | 百度在线网络技术(北京)有限公司 | Method and device used for establishing user binary relation bases |
CN103076892B (en) * | 2012-12-31 | 2016-09-28 | 百度在线网络技术(北京)有限公司 | A kind of method and apparatus of the input candidate item for providing corresponding to input character string |
CN103929448A (en) * | 2013-01-14 | 2014-07-16 | 百度国际科技(深圳)有限公司 | Method, system and device for providing cell word stock in cloud server |
CN103929448B (en) * | 2013-01-14 | 2018-06-05 | 百度国际科技(深圳)有限公司 | Server provides the method, system and device of cell dictionary beyond the clouds |
CN105095191A (en) * | 2014-04-22 | 2015-11-25 | 富士通株式会社 | Method and device for assisted translation based on multi-word units |
CN103927299A (en) * | 2014-04-25 | 2014-07-16 | 百度在线网络技术(北京)有限公司 | Method for providing candidate sentences in input method and method and device for recommending input content |
CN104360759B (en) * | 2014-11-21 | 2017-03-08 | 百度在线网络技术(北京)有限公司 | Candidate word sort method, device and character input method, equipment |
CN104360759A (en) * | 2014-11-21 | 2015-02-18 | 百度在线网络技术(北京)有限公司 | Candidate character sequencing method and device as well as character input method and equipment |
CN106445177A (en) * | 2015-08-06 | 2017-02-22 | 阿尔派株式会社 | Character input device and character input method |
CN106445177B (en) * | 2015-08-06 | 2020-06-30 | 阿尔派株式会社 | Character input device and character input method |
CN105607753B (en) * | 2015-12-15 | 2018-03-30 | 上海嵩恒网络科技有限公司 | The long sentence input method and long sentence input system of a kind of five |
CN105607753A (en) * | 2015-12-15 | 2016-05-25 | 上海嵩恒网络科技有限公司 | Long sentence input method and long sentence input system for five strokes |
CN107340881A (en) * | 2016-05-03 | 2017-11-10 | 北京搜狗科技发展有限公司 | A kind of input method and electronic equipment |
CN107340881B (en) * | 2016-05-03 | 2021-11-30 | 北京搜狗科技发展有限公司 | Input method and electronic equipment |
CN107422872A (en) * | 2016-05-24 | 2017-12-01 | 北京搜狗科技发展有限公司 | A kind of input method, device and the device for input |
CN107422872B (en) * | 2016-05-24 | 2021-11-30 | 北京搜狗科技发展有限公司 | Input method, input device and input device |
CN107688398A (en) * | 2016-08-03 | 2018-02-13 | 中国科学院计算技术研究所 | Determine the method and apparatus and input reminding method and device of candidate's input |
CN107688398B (en) * | 2016-08-03 | 2019-09-17 | 中国科学院计算技术研究所 | It determines the method and apparatus of candidate input and inputs reminding method and device |
CN108073293B (en) * | 2016-11-11 | 2022-01-14 | 北京搜狗科技发展有限公司 | Method and device for determining target phrase |
CN108073293A (en) * | 2016-11-11 | 2018-05-25 | 北京搜狗科技发展有限公司 | A kind of definite method and apparatus of target phrase |
CN108073292A (en) * | 2016-11-11 | 2018-05-25 | 北京搜狗科技发展有限公司 | A kind of intelligent word method and apparatus, a kind of device for intelligent word |
CN106557178A (en) * | 2016-11-29 | 2017-04-05 | 百度国际科技(深圳)有限公司 | For updating the method and device of input method entry |
CN106557178B (en) * | 2016-11-29 | 2021-03-09 | 百度国际科技(深圳)有限公司 | Method and device for updating entries of input method |
CN108241438B (en) * | 2016-12-23 | 2022-02-25 | 北京搜狗科技发展有限公司 | Input method, input device and input device |
CN108241438A (en) * | 2016-12-23 | 2018-07-03 | 北京搜狗科技发展有限公司 | A kind of input method, device and the device for input |
CN108628906A (en) * | 2017-03-24 | 2018-10-09 | 北京京东尚科信息技术有限公司 | Short text template method for digging, device, electronic equipment and readable storage medium storing program for executing |
CN108803890B (en) * | 2017-04-28 | 2024-02-06 | 北京搜狗科技发展有限公司 | Input method, input device and input device |
CN108803890A (en) * | 2017-04-28 | 2018-11-13 | 北京搜狗科技发展有限公司 | A kind of input method, input unit and the device for input |
CN109144284A (en) * | 2017-06-15 | 2019-01-04 | 百度在线网络技术(北京)有限公司 | information display method and device |
CN109144284B (en) * | 2017-06-15 | 2022-07-15 | 百度在线网络技术(北京)有限公司 | Information display method and device |
CN109426358A (en) * | 2017-09-01 | 2019-03-05 | 百度在线网络技术(北京)有限公司 | Data inputting method and device |
CN109542243A (en) * | 2017-09-21 | 2019-03-29 | 北京搜狗科技发展有限公司 | Phrase composing method and device, for the device of group word |
CN109917927A (en) * | 2017-12-13 | 2019-06-21 | 北京搜狗科技发展有限公司 | A kind of candidate item determines method and apparatus |
CN109961791B (en) * | 2017-12-22 | 2021-10-22 | 北京搜狗科技发展有限公司 | Voice information processing method and device and electronic equipment |
CN109961791A (en) * | 2017-12-22 | 2019-07-02 | 北京搜狗科技发展有限公司 | A kind of voice information processing method, device and electronic equipment |
CN110781288A (en) * | 2019-10-30 | 2020-02-11 | 安阳师范学院 | Method and device for composing words by Chinese characters |
CN111913591A (en) * | 2020-06-23 | 2020-11-10 | 杭州电子科技大学 | Reply phrase generation method, pinyin input method and intelligent terminal |
CN111913591B (en) * | 2020-06-23 | 2023-10-20 | 杭州电子科技大学 | Reply phrase generation method, pinyin input method and intelligent terminal |
Also Published As
Publication number | Publication date |
---|---|
CN100458795C (en) | 2009-02-04 |
WO2008098507A1 (en) | 2008-08-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN100458795C (en) | Intelligent word input method and input method system and updating method thereof | |
US8417512B2 (en) | Method, used by computers, for developing an ontology from a text in natural language | |
CN100595763C (en) | Full text retrieval system based on natural language | |
US9448995B2 (en) | Method and device for performing natural language searches | |
CN100405371C (en) | Method and system for abstracting new word | |
CN104021198B (en) | The relational database information search method and device indexed based on Ontology | |
CN101710343A (en) | Body automatic build system and method based on text mining | |
CN102253930B (en) | A kind of method of text translation and device | |
CN104281702A (en) | Power keyword segmentation based data retrieval method and device | |
CN103678576A (en) | Full-text retrieval system based on dynamic semantic analysis | |
CN105045852A (en) | Full-text search engine system for teaching resources | |
CN106055623A (en) | Cross-language recommendation method and system | |
CN103324700A (en) | Noumenon concept attribute learning method based on Web information | |
CN103324626A (en) | Method for setting multi-granularity dictionary and segmenting words and device thereof | |
CN105740227A (en) | Genetic simulated annealing method for solving new words in Chinese segmentation | |
Kallimani et al. | Information extraction by an abstractive text summarization for an Indian regional language | |
CN102043793A (en) | Knowledge-service-oriented recommendation method | |
CN114579104A (en) | Data analysis scene generation method, device, equipment and storage medium | |
Tonelli et al. | Boosting collaborative ontology building with key-concept extraction | |
US9292537B1 (en) | Autocompletion of filename based on text in a file to be saved | |
Baykara et al. | Abstractive text summarization and new large-scale datasets for agglutinative languages Turkish and Hungarian | |
CN101216836B (en) | Web page anchor text denoising system and method | |
CN104778232A (en) | Searching result optimizing method and device based on long query | |
CN103020311A (en) | Method and system for processing user search terms | |
Shrawankar et al. | Construction of news headline from detailed news article |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20090204 |
|
CF01 | Termination of patent right due to non-payment of annual fee |