CN103870000A

CN103870000A - Method and device for sorting candidate items generated by input method

Info

Publication number: CN103870000A
Application number: CN201210531929.8A
Authority: CN
Inventors: 吴先超
Original assignee: Baidu International Technology Shenzhen Co Ltd
Current assignee: Baidu International Technology Shenzhen Co Ltd
Priority date: 2012-12-11
Filing date: 2012-12-11
Publication date: 2014-06-18
Anticipated expiration: 2032-12-11
Also published as: CN103870000B

Abstract

The invention discloses a method and a device for sorting candidate items generated by an input method. The method includes utilizing the input method to receive current input information of current users; acquiring the user type of each candidate item in a candidate item set of the current input information of the current users according to G built different language models relevant to the user types; sorting the candidate items of the current input information of the current users according to the relevancy between the user type of each candidate item and the user types of a user type set of the current users; displaying the sorted candidate items. By means of the method, the candidate items with different sequences can be pushed to the users of different types when the users of the different types input identical characters, the time of user for selecting the candidate items is reduced, and the user experience is improved.

Description

Method and device that a kind of candidate item that input method is produced sorts

Technical field

The present invention relates to input method field, particularly relate to method and device that a kind of candidate item that input method is produced sorts.

Background technology

Input method refers to the coding method adopting for various symbols are inputted to computing machines or other equipment (as mobile phone).

While utilizing input method to input, conventionally by sending various characters to equipment to obtain the candidate item that character is corresponding, and select corresponding candidate item to complete input.And during for the character of identical pronunciation of input, in the set of the candidate item that the character that acquires is corresponding, the sequence of each candidate item is all the same conventionally.For assumed name " か Ga く ", the corresponding japanese character of this pronunciation comprises numerous candidates such as " value lattice, science, chemistry, song, Hua Yue ", the candidate who pushes to user is the same substantially, or the frequency only occurring in large-scale corpus according to candidate item sorts.

But present inventor finds in long-term research and development, the user of different types is also different for the ordering requirements of candidate item corresponding to same character.Push the candidate item of identical sequence to all users, most of user wastes the candidate item of a large amount of time to select to need conventionally, has so also reduced user's experience.

Summary of the invention

The technical matters that the present invention mainly solves is to provide a kind of sort method and device to candidate item, can input identical character for dissimilar user time, dissimilar user is pushed to the candidate item of different sequences, reduce the time that user selects candidate to spend, promote user and experience.

For solving the problems of the technologies described above, the technical scheme that the present invention adopts is: a kind of method that provides candidate item that input method is produced to sort, comprising: utilize described input method to receive active user's current input message; According to the G relevant from user type having set up a different language model, obtain the user type under each candidate item in the candidate item set of the current input message of described active user; According to the correlativity size of the user type in the user type set under the user type under described each candidate item and acquired described active user, the candidate item of the current input message of described active user is sorted; Show the candidate item of carrying out after described sequence.

Wherein, the G relevant from user type that described basis has been set up a different language model, before obtaining the step of the user type under each candidate item in the candidate item set of the current input message of described active user, also comprise: use Text Classification, the historical input message of multiple users is carried out to taxonomic revision, obtain G different user type to the user type relevant language material different with G class; By the language materials relevant to described user type different described G class, obtain the G relevant from user type different language model according to affiliated separately user type training respectively.

Wherein, according to the correlativity size of the user type in the user type set under the user type under described each candidate item and acquired described active user, before the step that the candidate item of the current input message of active user is sorted, also comprise: the historical input message of obtaining active user; According to described active user's historical input message, with G the different language model that the described and user type of having set up is relevant, active user is classified, obtain the affiliated user type set of active user.

Wherein, described according to the correlativity size of the user type in the user type set under the user type under each candidate item and acquired described active user, before the step that the candidate item of the current input message of active user is sorted, also comprise: obtain multiple users' historical input message, described multiple users belong to described G different user type; The historical input message of selected part from the described multiple users' that obtain historical input message; The historical input message of the described part of choosing is marked, obtain the corpus of multiple user annotations; By the corpus of described multiple user annotations and described G different user type, train the user sorter relevant to user type according to user type separately respectively with the machine learning method of teacher ' s guidance; According to active user's historical input message, with the described user sorter relevant to user type, active user is classified, obtain the affiliated user type set of active user.

Wherein, described historical input message is included in any one or the two or more historical input message in historical input message, the historical input message in JICQ and the historical input message in social network sites in input method application.

Wherein, described according to the correlativity size of the user type in the user type set under the user type under each candidate item and acquired described active user, the step that the candidate item of the current input message of active user is sorted comprises: according to the correlativity size of the user type in the user type set under the user type under each candidate item and described active user, obtain the weights of each candidate item in the candidate item set of the current input message of active user; According to the size of the weights of each candidate item in described candidate item set, the candidate item in the candidate item set of the current input message of active user is sorted.

Wherein, described according to the correlativity size of the user type in the user type set under the user type under each candidate item and affiliated active user, the step of obtaining the weights of each candidate item in the candidate item set of the current input message of active user comprises: obtain m user u ₁, u ₂..., u _mthe in the situation that of the current input message of the described active user of input, to same candidate item c _iselection number of times s ₁, s ₂..., s _m, wherein, m user belongs to G different user type; In G different user type, obtain described candidate item c _iweight w eight (c in user type g _i, g), that is:

weight (c_{i}, g) = \frac{P_{g} (c_{i})}{Σ_{g &Element; G} P_{g} (c_{i})}

Wherein, P _g(c _i) be illustrated under the language model that user type g is corresponding candidate item c _iprobability,

Obtain user u _mbelong to the weight w eight (u of a user type g _m, g), that is:

weight (u_{m}, g) = \frac{P_{g} (\log of u_{m})}{Σ_{g &Element; G} P_{g} (\log of u_{m})}

Wherein, P _g(log ofu _m) expression user u _minput log text, the probability under language model corresponding to user type g;

According to described weight w eight (c _i, g) and described weight w eight (u _m, user type set g g) and under user _m, obtain the weight w eight of each candidate item in the candidate item set of the current input message of active user ^k(c _i, u _m), that is:

{weight}^{k} (c_{i}, u_{m}) = \underset{g &Element; g_{m}}{Σ} weight (c_{i}, g) \times \frac{s_{m} \times weight (u_{m}, g)}{Σ_{g &Element; g_{m}} weight (u_{m}, g)} - {\cos t}^{k} (c_{i}, u_{m})

Wherein, k represents iteration the k time, cost (c _i, u _m) be candidate item c _ifor user u _mcost, cost ^k+1(c _i, u _mweightk (the c of)=- _i, u _m).

Wherein, described according to the size of the weights of each candidate item in candidate item set, after the step that candidate item in the candidate item set of the current input message of active user is sorted, also comprise: according to the size of the weights of each candidate item in described candidate item set, judge in candidate item set, whether have high frequency heat word or neologisms.

Wherein, described according to the weights size of each candidate item in candidate item set, whether have the step of high frequency heat word comprise: if the described weights that predetermined time iteration produces continuously of candidate item are all greater than the threshold value of default high frequency heat word, determine that described candidate item is high frequency heat word if judging in candidate item set.

Wherein, described according to the weights size of each candidate item in candidate item set, whether have the step of neologisms comprise: if the variable quantity compared with the described weights that the described weights that candidate item current iteration produces produce with a front iteration is greater than the threshold value of default neologisms, determine that described candidate item is neologisms if judging in candidate item set.

Wherein, described judge whether candidate item set has the step of high frequency heat word or neologisms after, also comprise: if while having high frequency heat word or neologisms in candidate item set, push described high frequency heat word or link corresponding to neologisms to the user of the user type under described high frequency heat word or neologisms.

Wherein, the step of the candidate item of described displaying after sorting comprises: show candidate item and the neologisms of candidate item or the mark of hot word after sorting.

Wherein, after the step of the candidate item after described displaying is sorted, also comprise: according to active user's switching command, according to described selection number of times s ₁, s ₂..., s _msummed result described candidate item is sorted; Show the candidate item and the described summed result that sort by described summed result.

Wherein, described language model is n-Gram language model or n-Pos language model.

Wherein, described according to the correlativity size of the user type in the user type set under the user type under each candidate item and acquired described active user, the step that the candidate item of the current input message of active user is sorted comprises: the correlativity according to the user type in the user type set under the user type under each candidate item and acquired described active user is descending, and the candidate item of the current input message of active user is sorted.

For solving the problems of the technologies described above, another technical solution used in the present invention is: the device that provides a kind of candidate item that input method is produced to sort, comprise receiver module, the first acquisition module, the first order module and display module, wherein: described receiver module is used for utilizing described input method to receive active user's current input message, and described active user's current input message is sent to described the first acquisition module; The G relevant from user type that described the first acquisition module has been set up for a basis different language model, obtain the user type under each candidate item in active user's the candidate item set of current input message, and the user type under each candidate item in the candidate item set of described active user's current input message is sent to described the first order module, wherein, G is natural number; Described the first order module is used for according to the correlativity size of the user type of the user type set under user type and acquired active user under described each candidate item, candidate item to the current input message of active user sorts, and the candidate item after sequence is sent to described display module; Described display module is for showing the candidate item after described sequence of carrying out from described the first order module.

Wherein, described device also comprises webpage language material module and the first training module, wherein: described webpage language material module is used for using Text Classification, the historical input message of multiple users is carried out to taxonomic revision, obtain G different user type to the user type relevant language material different with G class, and the G of described acquisition different user type to the user type relevant language material different with G class sent to described the first training module; Described the first training module, for by the language materials relevant to user type different described G class, obtains the G relevant from user type different language model according to affiliated separately user type training respectively.

Wherein, described device also comprises historical input message module and the second acquisition module, wherein: described historical input message module is used for obtaining active user's historical input message, and described active user's historical input message is exported to described the second acquisition module; Described the second acquisition module, for according to described active user's historical input message, is classified to active user with G the different language model that the described and user type of having set up is relevant, obtains the affiliated user type set of active user.

Wherein, described device also comprises the 3rd acquisition module, chooses module, labeling module, the second training module and sort module, wherein: described the 3rd acquisition module is used for obtaining multiple users' historical input message, and chooses module described in described multiple users' historical input message is sent to; The described module of choosing is for the historical input message of the historical input message selected part from the described multiple users that obtain, and historical the described part of choosing input message is sent to described labeling module; Described labeling module, for the historical input message of the described part of choosing is marked, obtains the corpus of multiple user annotations, and the corpus of multiple user annotations of described acquisition is sent to described the second training module; Described the second training module is for by the corpus of described multiple user annotations and described G different user type, trains the user sorter relevant to user type respectively with the machine learning method of teacher ' s guidance according to user type separately; Described sort module is used for according to active user's historical input message, and the relevant user's sorter of described and user type obtaining with described the second training module is classified to active user, obtains the affiliated user type set of active user.

Wherein, described the first order module comprises weights acquiring unit and sequencing unit, wherein: described weights acquiring unit is used for according to the correlativity size of the user type of the user type set under the user type under each candidate item and described active user, obtain the weights of each candidate item in the candidate item set of the current input message of active user, and the weights of each candidate item in described candidate item set are sent to sequencing unit; Described sequencing unit, for according to the size of the weights of the each candidate item of described candidate item set from described weights acquiring unit, sorts to the candidate item in the candidate item set of the current input message of active user.

Wherein, described weights acquiring unit comprises that first obtains subelement, second and obtain subelement, the 3rd and obtain subelement and the 4th and obtain subelement, wherein: described first obtains subelement for obtaining m user u ₁, u ₂..., u _mthe in the situation that of the current input message of the described active user of input, to same candidate item c _iselection number of times s ₁, s ₂..., s _m, wherein, m user belongs to G different user type; Described second obtains subelement for G different user type, obtains described candidate item c _iweight w eight (c in user type g _i, g), that is:

weight (c_{i}, g) = \frac{P_{g} (c_{i})}{Σ_{g &Element; G} P_{g} (c_{i})}

Wherein, P _g(c _i) be illustrated under the language model that user type g is corresponding candidate c _iprobability;

The described the 3rd obtains subelement for obtaining user u _mbelong to the weight w eight (u of a user type g _m, g), that is:

weight (u_{m}, g) = \frac{P_{g} (\log of u_{m})}{Σ_{g &Element; G} P_{g} (\log of u_{m})}

The described the 4th obtains subelement for according to described weight w eight (c _i, g) and described weight w eight (u _m, user type set g g) and under user _m, obtain the weight w eight of each candidate item in the candidate item set of the current input message of active user ^k(c _i, u _m), that is:

{weight}^{k} (c_{i}, u_{m}) = \underset{g &Element; g_{m}}{Σ} weight (c_{i}, g) \times \frac{s_{m} \times weight (u_{m}, g)}{Σ_{g &Element; g_{m}} weight (u_{m}, g)} - {\cos t}^{k} (c_{i}, u_{m})

Wherein, described device also comprises judge module, for according to the size of the weights of the each candidate item of described candidate item set, determines in candidate item set, whether have high frequency heat word or neologisms.

Wherein, described judge module, specifically in the time that the described weights that predetermined time iteration produces continuously of candidate item are all greater than the threshold value of default high frequency heat word, determines that described candidate item is high frequency heat word; Or variable quantity compared with the described weights that produce with a front iteration specifically for the described weights that produce when candidate item current iteration is while being greater than the threshold value of default neologisms, determines that described candidate item is neologisms.

Wherein, described device also comprises pushing module, pushes described high frequency heat word or link corresponding to neologisms for the user of the user type under described high frequency heat word or neologisms.

Wherein, described display module is specifically for showing candidate item and the neologisms of candidate item or the mark of hot word after sorting.

Wherein, described device also comprises the second order module, for according to active user's switching command, according to described selection number of times s ₁, s ₂..., s _msummed result described candidate item is sorted, and will send to described display module by the candidate item after the sequence of described summed result; Described display module is specifically for showing the candidate item and the described summed result that sort by described summed result.

Wherein, described the first order module is descending specifically for the correlativity of the user type in the user type set under the user type according under each candidate item and acquired described active user, candidate item to the current input message of active user sorts, and the candidate item after sequence is sent to display module.

The invention has the beneficial effects as follows: the situation that is different from prior art, the present invention obtains the user type under each candidate item in the candidate item set of user type set under active user and the current input message of active user, according to the correlativity size of the user type in the user type set under user type and active user under each candidate item, the candidate item of the current input message of active user is sorted.Because user type is divided and obtained based on the historical input message of user, the candidate item difference of paying close attention to according to dissimilar user, considers user type when candidate item is sorted.By such mode, dissimilar user is pushed to the candidate item of different sequences, can reduce user and select candidate's time, promote user and experience.

Brief description of the drawings

Fig. 1 is the process flow diagram of method one embodiment that sorts of candidate item that the present invention produces input method;

Fig. 2 is the process flow diagram of setting up the individual different language model of the G relevant from user type in method one embodiment that sorts of candidate item that the present invention produces input method;

Fig. 3 is the process flow diagram that obtains the user type set under active user in method one embodiment that sorts of candidate item that the present invention produces input method;

Fig. 4 is the process flow diagram that obtains the user type set under active user in another embodiment of method of sorting of candidate item that the present invention produces input method;

Fig. 5 is according to the correlativity size of the user type in the user type set under the user type under each candidate item and acquired active user, the process flow diagram that the candidate item of the current input message of active user is sorted in method one embodiment that sorts of candidate item that the present invention produces input method;

Fig. 6 is the process flow diagram of determining high frequency heat word in method one embodiment that sorts of candidate item that the present invention produces input method;

Fig. 7 is the displaying interface schematic diagram of a certain input method;

Fig. 8 is another displaying interface schematic diagram of a certain input method;

Fig. 9 is the structural representation of device one embodiment that sorts of candidate item that the present invention produces input method;

Figure 10 is the structural representation of another embodiment of device of sorting of candidate item that the present invention produces input method;

The structural representation of the another embodiment of device that Figure 11 candidate item that to be the present invention produce input sorts;

Figure 12 is the structural representation of the first order module of the present invention;

Figure 13 is the structural scheme of mechanism of weights acquiring unit of the present invention.

Embodiment

Below in conjunction with drawings and embodiments, the present invention is described in detail.

Consult Fig. 1, method one embodiment that the candidate item that the present invention produces input sorts comprises:

Step S101: utilize input method to receive active user's current input message;

Utilize input method to receive active user current input message, such as receiving pinyin character, assumed name or the English word of the current input of active user or sentence etc.

Step S102: according to the G relevant from user type having set up a different language model, obtain the user type under each candidate item in the candidate item set of the current input message of active user;

The object of language model (Language Model, LM) is to set up the distribution that can describe the probability of the appearance of given word sequence in language.Utilize language model, can determine that the possibility of which word sequence is larger, or given several words, the word that next most probable occurs can be predicted.For an example for sound word conversion, input Pinyin string is " nixianzaiganshenme ", corresponding output can have various ways, as " your present What for ", " what you catch up with in Xi'an again " etc., which is only correct transformation result so on earth? utilize language model, we know that the former probability is greater than the latter, therefore convert the former to more reasonable as a rule.Again for the example of a mechanical translation, a given Chinese sentence is " Li Mingzheng sees TV at home ", can be translated as " Li Ming is watching TV at home ", " Li Ming at home is watchingTV " etc., equally according to language model, we know that the former probability is greater than the latter, so translate into, the former is more reasonable.

So how calculate the probability of a sentence? such as given sentence (sequence of terms) is:

S＝W ₁，W ₂，...，W _k，

So, its probability can be expressed as:

P(S)＝P(W ₁，W ₂，...，W _k)＝p(W ₁)P(W ₂|W ₁)...P(W _k|W ₁，W ₂，...，W _k-1)

Because the parameter in above formula is too much, therefore need approximate computing method.Common method has n-gram model method, traditional decision-tree, maximum entropy model method, maximum entropy Markov model method, conditional random fields method, neural net method etc.

The G relevant from a user type different language model, can determine that a sentence, word or phrase or some word philosophies belong to the probability of G different user type, probability is larger, illustrates that this sentence, word or phrase or some words belong to the possibility of this user type just larger.

After user's input information, can produce corresponding multiple candidate item, these candidate item form candidate item set, according to the G relevant from user type different language model, can know the user type that each candidate item is affiliated.

In embodiment of the present invention, language model can be n-gram language model or n-pos language model.

The probability that in n-gram language model (n gram language model), current word occurs only has relation with n-1 the word on its left side.In the time that n gets 1,2,3, n-gram model is called unigram(mono-gram language model), bigram(bis-gram language model) and trigram language model (three gram language model).N is larger, and language model is more accurate, calculates also more complicatedly, and calculated amount is also larger.That the most frequently used is bigram, is secondly unigram and trigram, and it is less that n gets 4 the situation of being more than or equal to.In the time that n-gram language model is used Chinese web page, obtain Chinese n gram language model; In the time that n-gram language model is used English webpage, obtain English n gram language model.For example, in the time that n value is 2, the probability of the appearance of current word only has relation with its previous word.For example, for sentence:

S=Zhang San chairman of the board has delivered the speech of four preferential important instructions.

Under 2 gram language model, the probability of this sentence (weighing the tolerance of the correctness of this sentence) is:

Here <s> and </s>, is the word of two manual construction, has represented respectively beginning and the ending of sentence.(its objective is the probability of judgement " Zhang San " as sentence entry word, and "." fullstop is as the probability of sentence suffixed word)

If under 3 gram language model, the probability of this sentence is:

Here, in 2 meta-models, the computing method of a probability are:

P (chairman of the board | Zhang San)=count (Zhang San chairman of the board)/count (Zhang San)

Molecule is, the frequency that " Zhang San chairman of the board " for example, occurs in corpus (large scale network language material); Denominator is the frequency that " Zhang San " occurs in corpus.

Correspondingly, in 3 meta-models, the computing formula of a probability is:

P (deliver | Zhang San, chairman of the board)=count (Zhang San chairman of the board delivers)/count (Zhang San chairman of the board)

The molecule is here the frequency that " Zhang San chairman of the board delivers " occurs in corpus, and denominator is the frequency that " Zhang San chairman of the board " occurs in corpus.

Under n-pos model, suppose that we have sentence S=w ₁w ₂... w _k, it comprises K word, and P (S) can be write as:

P (S) = Π_{i = 1}^{K} P (w_{i} | c_{i}) P (c_{i} | c_{i - 1})

Be different from for example, in n-gram model (bigram) the direct w of use _i-1and w _iconditional probability P (w _i| w _i-1) portray P (S), introduced the thought of " Hidden Markov Model (HMM) " (HiddenMarkov Model-HMM) here, part of speech c _ias " hidden variable " (latent variable).Two kinds of probability in this formula, are used, P (w _i| c _i) what represent is from part of speech c _ito word w _i" generating probability " (or being called emission probability); P (c _i| c _i-1) what represent is part of speech bigram model, i.e. c _i-1after part of speech, meet c _ithe probability of part of speech.

In part of speech n-gram model, a part of speech c _ithe part of speech that depends on a front n-1 word that is to say the Probability Condition occurring:

P(c _i=c|history)=P(c _i=c|c _i-n+1,…,c _i-1)

N-pos model is that of n-gram model based on word is approximate in fact.Suppose, we have 10000 words, 10 parts of speech, and for the model of the word of bigram, we need to train 10000*10000 parameter so.And in n-pos model, we only need to train P (w _i| c _i) and P (c _i| c _i-1) just passable, the former number is 10000*10, the latter is 10*10.The number of the parameter that like this, we need to train will greatly reduce (reducing to 10000*10+10*10 from 10000*10000 here).

Notice a bit, along with the increase of the number of part of speech, n-pos model more and more approaches n-gram model.Terrifically, if part of speech of a word, n-pos model has been exactly n-gram model so.Extremely, if only have a part of speech, n-pos model is degenerated to uni-gram model to another one so.

Therefore, the advantage of n-pos language model is that data that it need to train are than n-gram language model much less, and the parameter space of model is also much smaller; Shortcoming is that the probability distribution of word depends on part of speech but not word itself, the probability distribution of obviously dividing word according to part of speech not as the division of word itself meticulous.Therefore, in (as speech recognition), this class language model is generally difficult to reach the precision of n-gram language model in actual applications.

Step S103: according to the correlativity size of the user type in the user type set under the user type under each candidate item and acquired active user, the candidate item of the current input message of active user is sorted;

If the user type under candidate item belongs to the user type in the user type set under active user, the user type correlativity under user type and the active user under candidate item is large so.If the user type under candidate item does not belong to the user type in the user type set under active user, the correlativity of the user type under user type and the active user under candidate item is little so.Above-mentioned evaluation correlativity size is rough judgement, in the time of more definite any the judgement correlativity size of needs, specifically also will see the probability size of candidate item in G different language model, and the correlativity that probability is large is just large, and the correlativity that probability is little is just little.

The more than type of user type possibility under user, may be multiple different user types, is therefore a user type set.Obtain the affiliated user type set of user and at least can pass through two kinds of modes: first, in system, preserve the affiliated user type information of associated user, this information is classified to user type according to language model, has determined the affiliated user type set of user; The second, user's input information time, determine the user type set under user according to language model.

For example, user type under active user is respectively wrist-watch fan, literature and art fan, caricature fan, net purchase fan, if the user type set under a candidate item is wrist-watch fan, net purchase fan, the correlativity of the user type under user type and the active user under this candidate item is just large so; If the user type set under a candidate item is hacker, cuisines fan, the correlativity of the user type under user type and the active user under this candidate item is just little so.

Can be descending according to correlativity, ascending in the time that candidate item is sorted or other modes sort.Certainly, in actual application, the sequence of candidate item is taking the descending sortord of correlativity as preferably, because before coming with the large candidate item of user type correlativity under user, when active user chooses candidate item, just do not need the time of overspending to check one by one selection, save the time of selecting candidate item.

Step S104: show the candidate item after sorting;

User after candidate item is sorted, the candidate item after sequence showed to user, so that can select the candidate item oneself needing.

By the elaboration of above-mentioned embodiment, the method that the candidate item that the present invention produces input method sorts, by obtaining the user type under each candidate item in the candidate item set of user type set under active user and the current input message of active user, according to the correlativity size of the user type in the user type set under user type and active user under each candidate item, the candidate item of the current input message of active user is sorted.Because user type is divided and obtained based on the historical input message of user, the candidate item difference of paying close attention to according to dissimilar user, considers user type when candidate item is sorted.By such mode, dissimilar user is pushed to the candidate item of different sequences, can reduce user and select candidate's time, promote user and experience.

In the method embodiment that the candidate item that the present invention produces input method sorts, conventionally need to set up in advance the corresponding G relevant from a user type different language model.To obtain the user type under each candidate item in the candidate item set of the current input message of active user by the relevant G of a user type different language model, and obtain the user type set under active user.

Refer to Fig. 2, in another embodiment of the method that the candidate item that the present invention produces input method sorts, the step of setting up the G relevant from user type different language model comprises:

Step S201: use Text Classification, the historical input message of multiple users is carried out to taxonomic revision, obtain G different user type to the user type relevant language material different with G class;

Text Classification is exactly under given taxonomic hierarchies, allows computing machine automatically determine the process of related category according to content of text.In present embodiment, use Text Classification, historical the multiple users that obtain input message is divided into several groups, an each group of corresponding user type, makes each group represent respectively different user types, and collects the language material relevant with different user types.

The historical input message of user can be included in any one or the two or more historical input message in historical input message, the historical input message in JICQ and the historical input message in social network sites in input method application.

For example, user, in using Japanese inputting method product, is uploaded to the historical input message of server; On the such JICQ of twitter, collect the historical information of user's input according to the new and old order of time; On the such social network sites of facebook, collect equally the historical information of user's input according to the new and old order of time.

The relevant language material of user type is here from each user's historical input message, obtain and obtain through relatively merging.Under a user's historical input message, may comprise the language material of multiple user types, by comparing with other users' historical input message, language material similar or same subscriber type is pooled together and form the language material that certain user type is relevant.

The statement that has comprised a large amount of relevant " Rolex " in historical input message such as a certain user, a large amount of statements about " Jiang Shidan pauses " in another user's historical input message, are comprised, two users can be merged into the user type of " wrist-watch fan ", at this moment two users' historical input message can be pooled together as " wrist-watch fan " relevant language material.

Utilize Text Classification, the historical input message of multiple users is carried out to taxonomic revision, can obtain G different user type to the user type relevant language material different with G class.

Step S202: by the language materials relevant to user type different G class, obtain the G relevant from user type different language model according to affiliated separately user type training respectively;

The language material relevant to user type according to each class, can train a language model relevant to user type.Such as the language model for hacker, for caricature fan's language model, for wrist-watch fan's language model and for language model of net purchase fan etc.

By the G relevant from user type different language model, can obtain the user type under each candidate item in the candidate item set of the current input message of active user.For example, four kinds of language models below: for the language model of hacker, for caricature fan's language model, for wrist-watch fan and for net purchase fan's language model, the probability of some candidate item in the candidate item of the current input message of user in four language models is respectively 0.6,0.4,0.01,0.008, the user type under this candidate item taking the user type of maximum probability so, i.e. hacker.

On the other hand, by the G relevant from user type different language model, can also obtain the affiliated user type set of active user.

Refer to Fig. 3, in another embodiment of the method that the candidate item that the present invention produces input method sorts, the step that obtains the affiliated user type set of active user comprises:

Step S301: the historical input message of obtaining active user;

Active user's historical input message can objectively reflect the information relevant to some aspect that user pays close attention to, and a user can pay close attention to the information relevant to multiple types.In addition, the information relevant to type of user's concern often changes, for example, in a period of time, the wrist-watch that active user pays close attention to information that be correlated with, that caricature is relevant, in another a period of time, what active user paid close attention to may be computing machine information that be correlated with, that cuisines are relevant.

Step S302: according to active user's historical input message, with the G relevant from user type having set up a different language model, active user is classified, obtain the affiliated user type set of active user;

According to active user's historical input message, use the G relevant from a user type different language model of having set up can realize the classification to active user, thereby obtain the user type set under active user.User's historical input message has corresponding probability in language model some or that certain several user type is relevant, and probability is larger, and the possibility that active user belongs to this user type is just larger.Can determine the user type under user according to the size of probability under normal circumstances, that is to say, the probability of active user's historical input message in language model some or that certain several user type is relevant with respect to it probability in the relevant language model of other several user types all want greatly, at this time can determine that this active user belongs to this one or several user type.

Above-mentioned user is classified, mainly contain two objects: the sparse input journal historical information that 1, alleviates alone family is inputted the negatively influencing of the learning algorithm of Behavior mining to user; 2, automatically identify and collect user's of the same type input journal information, the information that allows user's mutual " sharing " of same user type input separately, inputs experience to the better user of user.

Refer to Fig. 4, in the another embodiment of method that the candidate item that the present invention produces input method sorts, the step that obtains the affiliated user type set of active user comprises:

Step S401: the historical input message of obtaining multiple users;

Obtain multiple users' historical input message, multiple users belong to respectively G different user type.

Step S402: the historical input message of selected part from multiple users' of obtaining historical input message;

Step S403: the historical input message of the part of choosing is marked, obtain the corpus of multiple user annotations;

The historical input message of the part of choosing is marked, thereby can obtain relevant to type more accurately corpus, more accurate to user's classification meeting.

Step S404: by the corpus of multiple user annotations and G different user type, train the user sorter relevant to user type according to user type separately respectively with the machine learning method of teacher ' s guidance;

Machine learning (Machine Learning) is that the mankind's learning behavior is simulated or realized to research computing machine how, to obtain new knowledge or skills, reorganizes the existing structure of knowledge and make it constantly to improve the performance of self.It is the core of artificial intelligence, is to make computing machine have intelligent fundamental way, and its application spreads all over the every field of artificial intelligence, and it mainly uses conclusion, comprehensive instead of deduction.

The machine learning method of teacher ' s guidance refers to and trains a probability model with the corpus that marked, utilizes this probability model to carry out automatic classification to unknown data (or user to new data).

For instance, suppose that we will make a Chinese word segmentation instrument, automatically identify " word " in given Chinese sentence, and separate with space.

Like this, in using the machine learning method of teacher ' s guidance, we need to have " training corpus ", and its form class is similar to:

" Zhang San chairman of the board has delivered the speech of four preferential important instructions." etc. (ten hundreds of) sentence.

After having had these sentences, we can adopt the method for some machine learning, and such as CRF (conditional random fields – condition free field) model etc. trains a model.This model can to the sentence of the new input of user, " Li Si manager have listened to Zhang San chairman of the board's important instruction." be:

" Li Si manager has listened to Zhang San chairman of the board's important instruction.”

Here, so-called " teacher ", just refers to training corpus.

In the time training the user sorter relevant to user type, with the corpus relevant to user type having marked, in conjunction with each user's individual language material, can train the more accurate user sorter relevant to user type.

Step S405: according to active user's historical input message, use the user sorter relevant to user type to classify to active user, obtain the affiliated user type set of active user;

The embodiment of the user type set under above-mentioned two acquisition active users, wherein, user type set under the active user that the first embodiment obtains is more rough, and the user type set under the active user that the second embodiment obtains is more accurate with respect to the first embodiment.In actual application, user, according to the demand of oneself, can select arbitrarily to adopt any embodiment to obtain the affiliated user type set of active user.

On the other hand, refer to Fig. 5, in another embodiment of the method that the candidate item that the present invention produces input method sorts, according to the correlativity size of the user type in the user type set under the user type under each candidate item and acquired active user, the step that the candidate item of the current input message of active user is sorted comprises:

Step S501: according to the correlativity size of the user type in the user type set under user type and active user under each candidate item, obtain the weights of each candidate item in the candidate item set of the current input message of active user;

Obtain the size of the correlativity of the user type in the affiliated user type set of the affiliated user type of each candidate item and active user, the size of weights corresponding to each user type in the user type set under active user according to active user again, can obtain the size of the weights of each candidate item in the candidate item set of the current input message of active user.

For example, user type under current one of them candidate item be " hacker " and the affiliated user type set of active user is " hacker ", " wrist-watch fan ", so, the correlativity of the user type in the user type set under user type and the active user under candidate item is 1 (being both overlapped parts).If the weights in " hacker " user type set under active user are 0.35, the weights of this candidate item are 0.35 so.If the user type under another candidate item is " wrist-watch fan ", the user type correlativity in the user type set under user type and the active user under candidate item is also 1 so.If the weights in " wrist-watch fan " user type set under active user are 0.2, the weights of this candidate item are also 0.2 so.

The way of above-mentioned acquisition weights is a kind of more rough method, in actual application, can also obtain by the following method the weights of more accurate candidate item:

1, obtain m user u ₁, u ₂..., u _mthe in the situation that of the current input message of input active user, to same candidate item c _iselection number of times s ₁, s ₂..., s _m, wherein, m user belongs to G different user type;

2, in G different user type, obtain candidate item c _iweight w eight (c in user type g _i, g), that is:

weight (c_{i}, g) = \frac{P_{g} (c_{i})}{Σ_{g &Element; G} P_{g} (c_{i})}

3, obtain user u _mbelong to the weight w eight (u of a user type g _m, g), that is:

weight (u_{m}, g) = \frac{P_{g} (\log of u_{m})}{Σ_{g &Element; G} P_{g} (\log of u_{m})}

4, according to weight w eight (c _i, g) and weight w eight (u _m, user type set g g) and under user _m, obtain the weight w eight of each candidate item in the candidate item set of the current input message of active user ^k(c _i, u _m), that is:

{weight}^{k} (c_{i}, u_{m}) = \underset{g &Element; g_{m}}{Σ} weight (c_{i}, g) \times \frac{s_{m} \times weight (u_{m}, g)}{Σ_{g &Element; g_{m}} weight (u_{m}, g)} - {\cos t}^{k} (c_{i}, u_{m})

It should be noted that the above-mentioned candidate item c that obtains _iweight w eight (c in user type g _i, g) and user u _mbelong to the weight w eight (u of a user type g _m, g) strictly do not distinguish sequencing, user actual should process in, can select first obtain according to the actual conditions of oneself which weights.

By above-mentioned formula, we can, in a kind of mode of on-line study, constantly according to user's input journal information, upgrade the weights of each candidate item, to make the sequence of the each candidate item after renewal more approach user's actual demand, promote user and input experience.

It is worth mentioning that, the computing method of above-mentioned weights, have utilized the historical input message of user of same user type, are the technical methods of a kind of user profile, data sharing.

Step S502: according to the size of the weights of each candidate item in candidate item set, the candidate item in the candidate item set of the current input message of active user is sorted;

According to the size of the weights of each candidate item in candidate item set, can be descending according to weights, ascending or other mode the candidate item in the candidate item set of the current input message of active user is sorted.Certainly, most preferred mode is according to the descending mode of weights, candidate item to be sorted.

In actual application, whether the weights of the candidate item obtaining by above-mentioned embodiment, can for having high frequency heat word or neologisms in the candidate item set of definite current input message.

If the variable quantity compared with the weights that the weights that candidate item current iteration produces produce with a front iteration is greater than the threshold value of default neologisms, determine that this candidate item is neologisms.

For example, can pass through calculated candidate item c _ithe weight that produces of the k time iteration ^k(c _i, u _m) and the weight w eight that produces of the k-1 time iteration ^k-1(c _i, u _m) variable quantity, if weight ^k(c _i, u _m) – weight ^k-1(c _i, u _m) > θ, candidate item c _ineologisms.The θ is here the threshold value of default neologisms, and we can regulate according to overall neologisms number the threshold value θ of variation.

For example, we control under all assumed names and have allowed at most 1000 neologisms altogether, filter down according to this specification so, thus last definite threshold.Be we according to after this threshold filtering, finally remain about 1000 neologisms, be pushed to user.

Here, we can be according to the amount of the renewal of actual conditions and user data, selects to upgrade for one week an iteration etc.Like this, we can be that unit releases " one week neologisms " by week; Certainly, we can similarly, set one month, a season, and an iteration unit, thus release " neologisms in January ", " the neologisms first quarter " etc.

If the weights that predetermined time iteration produces continuously of candidate item are all greater than the threshold value of default high frequency heat word, determine that this candidate item is high frequency heat word.

For example, can pass through calculated candidate item c _ithe weights of continuous a iteration, if weight ^k-a+1(c _i, u _m) >b ..., weight ^k(c _i, u _m) >b, this candidate item c _ifor high frequency heat word.Here a, b can be as required the quantity of high frequency heat word arrange.

Allow at most altogether 2000 hot words such as we need to control under all assumed names, according to the scale of this final hot word, define the value of a and b.It should be noted that on the other hand the word in order to filter out those daily frequent uses in " hot word ", we can limit 80% left and right in " hot word " as required from " neologisms ", have one " neologisms " to arrive the conversion process of " hot word "; And remaining 20% left and right, will for example, from those works and expressions for everyday use (, this daily greeting such as " the tired れ Specifications of お In The-you have a long day ").By iterations, we can set " every selenothermy word, season hot word, annual hot word " etc.

In fact, hot word constantly changed with the neologisms time of all following, such as film name " Spider-Man ", when film has just started to show, " Spider-Man " this word may be neologisms, then continues for some time with neologisms, after a period of time, along with film hot broadcast, people utilize input method input more and more, and " Spider-Man " may just change hot word into.

Below, we illustrate with the definite of high frequency heat word:

Refer to Fig. 6, in another embodiment of the method that the candidate item that the present invention produces input method sorts, determine that the step of high frequency heat word comprises:

Step S601: judge whether that the weights that predetermined time iteration produces continuously of candidate item are all greater than the threshold value of default high frequency heat word;

High frequency heat word refers to that the frequency of appearance is higher, word, phrase or the sentence etc. that enjoy numerous users to pay close attention to.Obtain the weights that predetermined time iteration produces continuously of candidate item, judge whether that the weights that predetermined time iteration produces continuously of candidate item are all greater than the threshold value of default high frequency heat word.The threshold value of high frequency heat word can arrange as required voluntarily.In the time having the weights that predetermined time iteration produces continuously of candidate item to be all greater than the threshold value of default high frequency heat word, enter step S602, if do not have the weights that predetermined time iteration produces continuously of candidate item to be all greater than the threshold value of default high frequency heat word, enter step S603.

Step S602: determine that this candidate item is high frequency heat word;

In the time having the weights that predetermined time iteration produces continuously of candidate item to be all greater than the threshold value of default high frequency heat word, determine that this candidate item is high frequency heat word.The high frequency heat word obtaining can have many-sided application, such as being used for adding to coverage rate and the accuracy rate with raising prediction in some prediction dictionaries, or upgrades some language models etc.

Preferably, can be according to obtained high frequency heat word or neologisms, push some relevant web page interlinkages or search link etc. to each user of this high frequency heat word or user type corresponding to neologisms.Such as hypothesis sentence " unit's virtue; how you see " is high frequency heat word, in the time there is " unit's virtue; how you see " in the candidate item of user's input information, can push to user collection of drama link or the story introduction link of " expert detective Di Ren outstanding person ", or push link of relevant background knowledge introduction etc., can improve like this clicking rate of related web page, also can attract user to pay close attention to further the information that high frequency heat word is relevant.

Step S603: there is no high frequency heat word;

In the time not having the weights that continuously predetermined time iteration produces of candidate item to be all greater than the threshold value of default high frequency heat word, judge and in current candidate item, there is no high frequency heat word.

On the other hand, obtain the mode of candidate item weights according to above-mentioned embodiment, in actual application, can be further according to user's switching command, all users are sued for peace (being the historical selecteed number of times of each candidate item) to the selection number of times of each candidate item in current candidate item set, with the summed result of selecting number of times, the candidate item in candidate item set is sorted.

Such as can be by obtaining m user u ₁, u ₂..., u _mthe in the situation that of the current input message of input active user, to same candidate item c _iselection number of times s ₁, s ₂..., s _m, m user is E=s to the selection number of times summation of the candidate item in current candidate item set ₁+ s ₂+ ... + s _m, with this summed result, current candidate item is sorted and shows user.

Refer to Fig. 7, wherein A is that the result that under a certain input method, the weights according to candidate item sort is shown, the selected number of times of history of candidate item and the mark of neologisms or hot word have also been shown simultaneously, if user presses according to the switching command of selecting number of times sequence, the sequence of candidate item is according to selecting number of times to sort.Wherein, the B in Fig. 7 shows according to the result of selecting the descending of number of times to sort.

In practical application, can sort according to ascending order or the descending of selecting number of times, such as user presses switching command for the first time, sort and show according to the ascending order of selecting number of times, again press switching command, sort and show according to the descending of selecting number of times, pressing for the third time switching command, recovering originally sort and show according to the weights of candidate item.Certainly, the ordering rule that above-mentioned switching command is corresponding is one gives an example, and in practical application, can arrange voluntarily as required.

According to above-mentioned embodiment, showing that candidate item is during to user, can show the historical selecteed number of times of each candidate item, hot word or the mark of neologisms etc. of candidate item any one or much information simultaneously.

Wherein, can be referring to Fig. 8, the displaying interface of a certain input method to candidate item.As shown in the figure, the selected number of times of history and the neologisms of candidate item or the mark of hot word of candidate item and candidate item in figure, in A part, have been shown simultaneously.Wherein, candidate item " Hua Yue " is designated neologisms, in the time that user is interested in neologisms " Hua Yue ", when it selects focus to reach " Hua Yue ", show the picture shown in B part to user, be the scenic spots and historical sites " Hua Yue temple " of the high popularity of what is called corresponding to place name, and enclose an arrow that represents hyperlink below at " Hua Yue temple ".In the time that user focus arrives arrow, present the corresponding search chained address at " the Hua Yue temple " of the displaying of C part to user, when user clicks arrow, displaying searching result in browser.

Certainly, the above-mentioned exhibition method to candidate item, just for example a kind of, in practical application, be not limited to aforesaid way, such as representing that the relevant hyperlink of neologisms mark not necessarily represents with arrow, can be a finger icon or other, it is clickable hyperlinks mark that the mode of opening link is also not limited to, can open the mode of hyperlink by existing other, such as shortcut etc.

The language model of mentioning in above-mentioned any one embodiment is n-gram language model or n-pos language model.

It should be noted that in the respective embodiments described above, relate to user type classification and all launch around " personal user ".For enterprise-class tools, embodiments of the present invention are suitable for too.For the purpose of simple, the present invention only describes the characteristic feature of enterprise-class tools here:

1, the different trunk of each of individual enterprise branch, a respectively corresponding user type, and an also corresponding larger user type of whole enterprise, collect the historical input message of each user type so categorizedly, and collect, train the language model relevant to user type;

2,, according to the business content of this enterprise etc., push the cell dictionary of correlation type or the high frequency heat word link of correlation type etc.

Refer to Fig. 9, device one embodiment that the candidate item that the present invention produces input method sorts comprises receiver module 11, the first acquisition module 12, the first order module 13 and display module 14, wherein:

Receiver module 11 is for utilizing input method to receive active user's current input message, and active user's current input message is sent to the first acquisition module 12;

Receiver module 11 utilizes input method to receive active user current input message, such as receiving pinyin character, assumed name or the English word of the current input of active user or sentence etc.Current the active user of reception input message is sent to the first acquisition module 12.

The G relevant from user type that the first acquisition module 12 has been set up for a basis different language model, obtain the user type under each candidate item in active user's the candidate item set of current input message, and the user type under each candidate item in the candidate item set of active user's current input message is sent to the first order module 13, wherein, G is natural number;

The object of language model (Language Model, LM) is to set up the distribution that can describe the probability of the appearance of given word sequence in language.Utilize language model, can determine that the possibility of which word sequence is larger, or given several words, the word that next most probable occurs can be predicted.

In embodiment of the present invention, language model includes but not limited to it is n-gram language model or n-pos language model.

The first order module 13 is for the correlativity size of the user type of the user type set under the user type according under each candidate item and acquired active user, candidate item to the current input message of active user sorts, and the candidate item after sequence is sent to display module 14;

For example, user type under active user is respectively " wrist-watch fan ", " literature and art fan ", " caricature fan ", " net purchase fan ", if the user type set under a candidate item is " wrist-watch fan ", " net purchase fan ", the correlativity of the user type under user type and the active user under this candidate item is just large so; If the user type set under a candidate item is " hacker ", " cuisines fan ", the correlativity of the user type under user type and the active user under this candidate item is just little so.

The first order module 13 in the time that candidate item is sorted, can be descending according to correlativity, ascending or other modes sort.Certainly, in actual application, the sequence of candidate item is taking the descending sortord of correlativity as preferably, because before coming with the large candidate item of user type correlativity under user, when active user chooses candidate item, just do not need the time of overspending to check one by one selection, save the time of selecting candidate item.

Display module 14 is for showing from the candidate item after the sorting of the first order module.

User after candidate item is sorted, by display module 14, the candidate item after sequence showed to user, so that can select the candidate item oneself needing.

Refer to Figure 10, in another embodiment of the device that candidate item input method being produced in the present invention sorts, comprise webpage language material module 21, the first training module 22, historical input message module 23, the second acquisition module 24, receiver module 25, the first acquisition module 26, the first order module 27 and display module 28, wherein:

Webpage language material module 21 is for being used Text Classification, the historical input message of multiple users is carried out to taxonomic revision, obtain G different user type to the user type relevant language material different with G class, and the G of acquisition different user type to the user type relevant language material different with G class sent to the first training module 22;

Webpage language material module 21 is utilized Text Classification, and the historical input message of multiple users is carried out to taxonomic revision, can obtain G different user type to the user type relevant language material different with G class.And the G of acquisition different user type to the user type relevant language material different with G class sent to the first training module 22.

The first training module 22, for by the language materials relevant to user type different G class, obtains the G relevant from user type different language model according to affiliated separately user type training respectively;

The language material relevant to user type that the first training module 22 is different according to G class, obtains the G relevant from user type different language model according to the corresponding user type training of language material respectively.

Historical input message module 23 is for obtaining active user's historical input message, and active user's historical input message is exported to the second acquisition module 24;

Active user's historical input message can objectively reflect the information relevant to some aspect that user pays close attention to, and a user can pay close attention to the information relevant to multiple types.In addition, the information relevant to type of user's concern often changes, for example, in a period of time, the wrist-watch that active user pays close attention to information that be correlated with, that caricature is relevant, in another a period of time, what active user paid close attention to may be computing machine information that be correlated with, that cuisines are relevant.Historical input message module 23 can be obtained active user's historical input message, and the active user's who obtains historical input message is exported to the second acquisition module 24.

The second acquisition module 24, for according to active user's historical input message, is classified to active user with the G relevant from user type having set up a different language model, obtains the affiliated user type set of active user;

The second acquisition module 24, according to active user's historical input message, is used the G relevant from a user type different language model of having set up can realize the classification to active user, thereby is obtained the user type set under active user.User's historical input message has corresponding probability in language model some or that certain several user type is relevant, and probability is larger, and the possibility that active user belongs to this user type is just larger.Can determine the user type under user according to the size of probability under normal circumstances, that is to say, the probability of active user's historical input message in language model some or that certain several user type is relevant with respect to it probability in the relevant language model of other several user types all want greatly, at this time can determine that this active user belongs to this one or several user type.

By the second acquisition module 24, user is classified, mainly contain two objects: the sparse input journal historical information that 1, alleviates alone family is inputted the negatively influencing of the learning algorithm of Behavior mining to user; 2, automatically identify and collect user's of the same type input journal information, the information that allows user's mutual " sharing " of same user type input separately, inputs experience to the better user of user.

Receiver module 25 is for utilizing input method to receive active user's current input message, and active user's current input message is sent to the first acquisition module 26;

The G relevant from user type that the first acquisition module 26 has been set up for a basis different language model, obtain the user type under each candidate item in active user's the candidate item set of current input message, and the user type under each candidate item in the candidate item set of active user's current input message is sent to the first order module 27, wherein, G is natural number;

The first order module 27 is for the correlativity size of the user type of the user type set under the user type according under each candidate item and acquired active user, candidate item to the current input message of active user sorts, and the candidate item after sequence is sent to display module 28;

Display module 28 is for showing from the candidate item after the sorting of the first order module.

Refer to Figure 11, in the another embodiment of device that the candidate item that the present invention produces input method sorts, comprise webpage language material module 31, the first training module 32, the 3rd acquisition module 33, choose module 34, labeling module 35, the second training module 36, sort module 37, receiver module 38, the first acquisition module 39, the first order module 40, display module 41, judge module 42, pushing module 43 and the second order module 44, wherein:

Webpage language material module 31 is for being used Text Classification, the historical input message of multiple users is carried out to taxonomic revision, obtain G different user type to the user type relevant language material different with G class, and the G of acquisition different user type to the user type relevant language material different with G class sent to the first training module 32;

The first training module 32, for by the language materials relevant to user type different G class, obtains the G relevant from user type different language model according to affiliated separately user type training respectively;

The 3rd acquisition module 33 is for obtaining multiple users' historical input message, and multiple users' historical input message is sent to and chooses module 34;

The 3rd acquisition module 33 obtains multiple users' historical input message, and multiple users belong to respectively G different user type.And the multiple users' that obtain historical input message is sent to and chooses module 34.

Choose the historical input message selected part historical input message of module 34 for the multiple users from obtaining, and historical the part of choosing input message is sent to labeling module 35;

Labeling module 35 is carried out manual mark for the historical input message of the part to choosing, and obtains the corpus of the manual mark of multiple users, and the corpus of the manual mark of multiple users obtaining is sent to the second training module 36;

Labeling module 35 marks the historical input message of the part of choosing, thereby can obtain relevant to type more accurately corpus, more accurate to user's classification meeting.

The second training module 36 is for by the corpus of the manual mark of multiple users and G different user type, trains the user sorter relevant to user type respectively with the machine learning method of teacher ' s guidance according to user type separately;

Machine learning (Machine Learning) is that the mankind's learning behavior is simulated or realized to research computing machine how, to obtain new knowledge or skills, reorganizes the existing structure of knowledge and make it constantly to improve the performance of self.It is the core of artificial intelligence, is to make computing machine have intelligent fundamental way, and its application spreads all over the every field of artificial intelligence, and it mainly uses conclusion, comprehensive instead of deduction.The machine learning method of teacher ' s guidance refers to that the language material of using the corpus having marked not mark all the other carries out automatic classification, thereby can obtain more accurate classification results by less input.

Sort module 37, for according to active user's historical input message, is classified to active user with the second training module 36 user's sorter relevant to user type that obtain, obtains the affiliated user type set of active user;

Receiver module 38 is for utilizing input method to receive active user's current input message, and active user's current input message is sent to the first acquisition module 39;

The G relevant from user type that the first acquisition module 39 has been set up for a basis different language model, obtain the user type under each candidate item in active user's the candidate item set of current input message, and the user type under each candidate item in the candidate item set of active user's current input message is sent to the first order module 40, wherein, G is natural number;

The first order module 40 is for the correlativity size of the user type of the user type set under the user type according under each candidate item and acquired active user, candidate item to the current input message of active user sorts, and the candidate item after sequence is sent to display module 41;

Refer to Figure 12, in another embodiment of the device that the candidate item that the present invention produces input method sorts, the first order module also comprises weights acquiring unit 111 and sequencing unit 112:

Weights acquiring unit 111 is for according to the correlativity size of the user type of the user type set under user type and active user under each candidate item, obtain the weights of each candidate item in the candidate item set of the current input message of active user, and the weights of each candidate item in candidate item set are sent to sequencing unit 112;

Wherein, refer to Figure 13, in device one embodiment that the candidate item that the present invention produces input method sorts, the weights acquiring unit of the first order module comprises that first obtains subelement 211, second and obtain subelement 212, the 3rd and obtain subelement 213 and the 4th and obtain subelement 214, wherein:

First obtains subelement 211 for obtaining m user u ₁, u ₂..., u _mthe in the situation that of the current input message of input active user, to same candidate item c _iselection number of times s ₁, s ₂..., s _m, wherein, m user belongs to G different user type;

Second obtains subelement 212 for G different user type, obtains candidate item c _iweight w eight (c in user type g _i, g), that is:

weight (c_{i}, g) = \frac{P_{g} (c_{i})}{Σ_{g &Element; G} P_{g} (c_{i})}

The 3rd obtains subelement 213 for obtaining user u _mbelong to the weight w eight (u of a user type g _m, g), that is:

weight (u_{m}, g) = \frac{P_{g} (\log of u_{m})}{Σ_{g &Element; G} P_{g} (\log of u_{m})}

The 4th obtains subelement 214 for according to weight w eight (c _i, g) and weight w eight (u _m, user type set g g) and under user _m, obtain the weight w eight of each candidate item in the candidate item set of the current input message of active user ^k(c _i, u _m), that is:

{weight}^{k} (c_{i}, u_{m}) = \underset{g &Element; g_{m}}{Σ} weight (c_{i}, g) \times \frac{s_{m} \times weight (u_{m}, g)}{Σ_{g &Element; g_{m}} weight (u_{m}, g)} - {\cos t}^{k} (c_{i}, u_{m})

Sequencing unit 112, for according to the size of the weights of the each candidate item of candidate item set from weights acquiring unit 111, sorts to the candidate item in the candidate item set of the current input message of active user.

Sequencing unit 112 is according to the size of the weights of each candidate item in candidate item set, can be descending according to weights, ascending or other mode the candidate item in the candidate item set of the current input message of active user is sorted.Certainly, most preferred mode is according to the descending mode of weights, candidate item to be sorted.

Display module 41 is for showing from the candidate item after the sorting of the first order module.

In actual application, the weights of the candidate item obtaining by above-mentioned embodiment, can be for determining high frequency heat word or neologisms.

Please continue to refer to Figure 11, the device that the candidate item that input method is produced of present embodiment sorts further comprises judge module 42 and pushing module 43, wherein:

Judge module 42, for according to the size of the weights of the each candidate item of candidate item set, determines whether candidate item set has high frequency heat word or neologisms;

When variable quantity compared with the weights that judge module 42 can produce with a front iteration for the weights that produce when candidate item current iteration is greater than the threshold value of default neologisms, determine that candidate item is neologisms.

For example, can pass through calculated candidate item c _ithe k-1 time iteration and the variable quantity weight of the weights that produce of the k time iteration ^k(c _i, u _m) – weight ^k-1(c _i, u _m) > θ, the θ is here the threshold value of default neologisms, we can regulate according to overall neologisms number the threshold value θ of variation.

On the other hand, judge module 42 can also be used for, in the time that the weights that predetermined time iteration produces continuously of candidate item are all greater than the threshold value of default high frequency heat word, determining that this candidate item is high frequency heat word.

For example, can pass through calculated candidate item c _ithe weight w eight of a iteration ^k-a+1(c _i, u _m) >b ..., weight ^k(c _i, u _m) >b, this candidate item c _ifor high frequency heat word.Here a, b can be as required the quantity of high frequency heat word arrange.

Allow at most altogether 2000 hot words such as we will control under all assumed names, according to the scale of this final hot word, define the value of a and b.Should be noted that on the other hand, in order to filter out the word of those daily frequent uses in " hot word ", we can limit 80% left and right in " hot word " as required from " neologisms ", have one " neologisms " to arrive the conversion process of " hot word "; And remaining 20% left and right, will for example, from those works and expressions for everyday use (, this daily greeting such as " the tired れ Specifications of お In The-you have a long day ").By iterations, we can set " every selenothermy word, season hot word, annual hot word " etc.

Pushing module 43 pushes high frequency heat word or link corresponding to neologisms for the user of the user type under high frequency heat word or neologisms.

Pushing module 43 can be according to obtained high frequency heat word or neologisms, push some relevant web page interlinkages or search link etc. to each user of this high frequency heat word or user type corresponding to neologisms.

Such as hypothesis sentence " unit's virtue; how you see " is high frequency heat word, in the time there is " unit's virtue; how you see " in the candidate item of user's input information, can push to user collection of drama link or the story introduction link of " expert detective Di Ren outstanding person ", or push link of relevant background knowledge introduction etc., can improve like this clicking rate of related web page, also can attract user to pay close attention to further the information that high frequency heat word is relevant.

Further, please continue to refer to Figure 11, the device of present embodiment can also comprise the second order module 44, be used for according to active user's switching command, according to select number of times s1, s2 ..., sm summed result candidate item is sorted, and by the candidate item by after summed result sequence to display module 41.

Obtain the mode of candidate item weights according to above-mentioned embodiment, in actual application, the second order module 44 can be further according to user's switching command, all users are sued for peace (being the historical selecteed number of times of each candidate item) to the selection number of times of each candidate item in current candidate item set, with the summed result of selecting number of times, the candidate item in candidate item set is sorted.

Display module 41, showing that candidate item is during to user, can be shown the historical selecteed number of times of each candidate item simultaneously, hot word or the mark of neologisms etc. of candidate item any one or much information.

The language model of mentioning in above-mentioned embodiment can be n-gram language model or n-pos language model.

By the description of above-mentioned embodiment, be appreciated that, the method that the candidate item that the present invention produces input method sorts and device, by obtaining the user type set under active user, and the user type under each candidate item in the candidate item set of the current input message of acquisition active user, according to the correlativity size of the user type in the user type set under user type and active user under each candidate item, the candidate item of the current input message of active user is sorted.Because user type is divided and obtained based on the historical input message of user, the candidate item difference of paying close attention to according to dissimilar user, considers when candidate item is sorted that user type sorts.By such mode, dissimilar user is pushed to the candidate item of different sequences, can reduce user and select candidate's time, promote user and experience.

On the other hand, determine high frequency heat word by the weights of candidate item, the high frequency heat word obtaining can have many-sided application, such as being used for adding in some prediction dictionaries to improve coverage rate and the accuracy rate of prediction, or upgrade some language models etc., can also, according to obtained high frequency heat word, push some relevant web page interlinkages or search link etc. to each user of user type corresponding to this high frequency heat word.Can improve like this clicking rate of related web page, also can attract user to pay close attention to further the information that high frequency heat word is relevant.

Further, can, according to user's switching command, sort according to the selected number of times of the history of candidate item.Meanwhile, showing in candidate item, can show in the lump as required any one or much information of hot word or the mark of neologisms etc. of the selected number of times of history, the candidate item of candidate item, more input experience to user.

In several embodiments provided by the present invention, should be understood that, disclosed system, apparatus and method, can realize by another way.For example, device embodiments described above is only schematic, for example, the division of described module or unit, be only that a kind of logic function is divided, when actual realization, can have other dividing mode, for example multiple unit or assembly can in conjunction with or can be integrated into another system, or some features can ignore, or do not carry out.Another point, shown or discussed coupling each other or direct-coupling or communication connection can be by some interfaces, indirect coupling or the communication connection of device or unit can be electrically, machinery or other form.

The described unit as separating component explanation can or can not be also physically to separate, and the parts that show as unit can be or can not be also physical locations, can be positioned at a place, or also can be distributed in multiple network element.Can select according to the actual needs some or all of unit wherein to realize the object of present embodiment scheme.

In addition, the each functional unit in each embodiment of the present invention can be integrated in a processing unit, can be also that the independent physics of unit exists, and also can be integrated in a unit two or more unit.Above-mentioned integrated unit both can adopt the form of hardware to realize, and also can adopt the form of SFU software functional unit to realize.

If described integrated unit is realized and during as production marketing independently or use, can be stored in a computer read/write memory medium using the form of SFU software functional unit.Based on such understanding, the all or part of of the part that technical scheme of the present invention contributes to prior art in essence in other words or this technical scheme can embody with the form of software product, this computer software product is stored in a storage medium, comprise that some instructions (can be personal computers in order to make a computer equipment, server, or the network equipment etc.) or processor (processor) carry out all or part of step of method described in each embodiment of the application.And aforesaid storage medium comprises: various media that can be program code stored such as USB flash disk, portable hard drive, ROM (read-only memory) (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disc or CDs.

The foregoing is only embodiments of the present invention; not thereby limit the scope of the claims of the present invention; every equivalent structure or conversion of equivalent flow process that utilizes instructions of the present invention and accompanying drawing content to do; or be directly or indirectly used in other relevant technical fields, be all in like manner included in scope of patent protection of the present invention.

Claims

1. the method that candidate item input method being produced sorts, is characterized in that, comprising:

Utilize described input method to receive active user's current input message;

According to the G relevant from user type having set up a different language model, obtain the user type under each candidate item in the candidate item set of the current input message of described active user;

According to the correlativity size of the user type in the user type set under the user type under described each candidate item and acquired described active user, the candidate item of the current input message of described active user is sorted;

Show the candidate item of carrying out after described sequence.

2. method according to claim 1, it is characterized in that, the G relevant from user type that described basis has been set up different language model, before obtaining the step of the user type under each candidate item in the candidate item set of the current input message of active user, also comprises:

Use Text Classification, the historical input message of multiple users is carried out to taxonomic revision, obtain G different user type to the user type relevant language material different with G class;

By the language materials relevant to described user type different described G class, obtain the G relevant from user type different language model according to affiliated separately user type training respectively.

3. method according to claim 1, it is characterized in that, according to the correlativity size of the user type in the user type set under the user type under described each candidate item and acquired described active user, before the step that the candidate item of the current input message of active user is sorted, also comprise:

Obtain active user's historical input message;

According to described active user's historical input message, with G the different language model that the described and user type of having set up is relevant, active user is classified, obtain the affiliated user type set of active user.

4. method according to claim 2, it is characterized in that, described according to the correlativity size of the user type in the user type set under the user type under described each candidate item and acquired described active user, before the step that the candidate item of the current input message of active user is sorted, also comprise:

Obtain multiple users' historical input message, described multiple users belong to described G different user type;

The historical input message of selected part from the described multiple users' that obtain historical input message;

The historical input message of the described part of choosing is marked, obtain the corpus of multiple user annotations;

By the corpus of described multiple user annotations and described G different user type, train the user sorter relevant to user type according to user type separately respectively with the machine learning method of teacher ' s guidance;

According to active user's historical input message, with the described user sorter relevant to user type, active user is classified, obtain the affiliated user type set of active user.

5. according to the method described in claim 3-5 any one, it is characterized in that, described historical input message is included in any one or the two or more historical input message in historical input message, the historical input message in JICQ and the historical input message in social network sites in input method application.

6. method according to claim 1, it is characterized in that, described according to the correlativity size of the user type in the user type set under the user type under described each candidate item and acquired described active user, the step that the candidate item of the current input message of active user is sorted comprises:

According to the correlativity size of the user type in the user type set under the user type under each candidate item and described active user, obtain the weights of each candidate item in the candidate item set of the current input message of active user;

According to the size of the weights of each candidate item in described candidate item set, the candidate item in the candidate item set of the current input message of active user is sorted.

7. method according to claim 6, it is characterized in that, described according to the correlativity size of the user type in the user type set under the user type under each candidate item and affiliated active user, the step of obtaining the weights of each candidate item in the candidate item set of the current input message of active user comprises:

Obtain m user u ₁, u ₂..., u _mthe in the situation that of the current input message of the described active user of input, to same candidate item c _iselection number of times s ₁, s ₂..., s _m, wherein, m user belongs to G different user type;

In G different user type, obtain described candidate item c _iweight w eight (c in user type g _i, g), that is:

weight (c_{i}, g) = \frac{P_{g} (c_{i})}{Σ_{g &Element; G} P_{g} (c_{i})}

Wherein, P _g(c _i) be illustrated under the language model that customer group g is corresponding candidate c _iprobability,

weight (u_{m}, g) = \frac{P_{g} (\log of u_{m})}{Σ_{g &Element; G} P_{g} (\log of u_{m})}

Wherein, P _g(log ofu _m) expression user u _minput log text, the probability under language model corresponding to customer group g;

{weight}^{k} (c_{i}, u_{m}) = \underset{g &Element; g_{m}}{Σ} weight (c_{i}, g) \times \frac{s_{m} \times weight (u_{m}, g)}{Σ_{g &Element; g_{m}} weight (u_{m}, g)} - {\cos t}^{k} (c_{i}, u_{m})

Wherein, k represents iteration the k time, cost (c _i, u _m) be candidate item c _ifor user u _mcost, cost ^k+1(c _i, u _mthe weight of)=- ^k(c _i, u _m).

8. method according to claim 7, is characterized in that, described according to the size of the weights of each candidate item in candidate item set, after the step that the candidate item in the candidate item set of the current input message of active user is sorted, also comprises:

According to the size of the weights of each candidate item in described candidate item set, judge in candidate item set, whether have high frequency heat word or neologisms.

9. whether method according to claim 8, is characterized in that, described according to the weights size of each candidate item in candidate item set, judge in candidate item set and have the step of high frequency heat word to comprise:

If the described weights that predetermined time iteration produces continuously of candidate item are all greater than the threshold value of default high frequency heat word in candidate item set, determine that described candidate item is high frequency heat word.

10. whether method according to claim 8, is characterized in that, described according to the weights size of each candidate item in candidate item set, judge in candidate item set and have the step of neologisms to comprise:

If the variable quantity compared with the described weights that the described weights that in candidate item set, candidate item current iteration produces produce with a front iteration is greater than the threshold value of default neologisms, determine that described candidate item is neologisms.

11. methods according to claim 8, is characterized in that, described judge whether candidate item set has the step of high frequency heat word or neologisms after, also comprise:

If while having high frequency heat word or neologisms in candidate item set, push described high frequency heat word or link corresponding to neologisms to the user of the user type under described high frequency heat word or neologisms.

12. methods according to claim 8, is characterized in that, the step of the candidate item after described displaying is sorted comprises:

Show candidate item and the neologisms of candidate item or the mark of high frequency heat word after sorting.

13. methods according to claim 7, is characterized in that, after the step of the candidate item after described displaying is sorted, also comprise:

According to active user's switching command, according to described selection number of times s ₁, s ₂..., s _msummed result described candidate item is sorted;

Show the candidate item and the described summed result that sort by described summed result.

14. according to the method described in claim 1-4 any one, it is characterized in that, described language model is n-Gram language model or n-Pos language model.

15. 1 kinds of devices that the candidate item that input method is produced sorts, is characterized in that, comprise receiver module, the first acquisition module, the first order module and display module, wherein:

Described receiver module is used for utilizing described input method to receive active user's current input message, and described active user's current input message is sent to described the first acquisition module;

The G relevant from user type that described the first acquisition module has been set up for a basis different language model, obtain the user type under each candidate item in active user's the candidate item set of current input message, and the user type under each candidate item in the candidate item set of described active user's current input message is sent to described the first order module, wherein, G is natural number;

Described the first order module is used for according to the correlativity size of the user type of the user type set under user type and acquired active user under described each candidate item, candidate item to the current input message of active user sorts, and the candidate item after sequence is sent to described display module;

Described display module is for showing the candidate item after described sequence of carrying out from described the first order module.

16. devices according to claim 15, is characterized in that, described device also comprises webpage language material module and the first training module, wherein:

Described webpage language material module is used for using Text Classification, the historical input message of multiple users is carried out to taxonomic revision, obtain G different user type to the user type relevant language material different with G class, and the G of described acquisition different user type to the user type relevant language material different with G class sent to described the first training module;

Described the first training module, for by the language materials relevant to user type different described G class, obtains the G relevant from user type different language model according to affiliated separately user type training respectively.

17. devices according to claim 15, is characterized in that, described device also comprises historical input message module and the second acquisition module, wherein:

Described historical input message module is used for obtaining active user's historical input message, and described active user's historical input message is exported to described the second acquisition module;

Described the second acquisition module, for according to described active user's historical input message, is classified to active user with G the different language model that the described and user type of having set up is relevant, obtains the affiliated user type set of active user.

18. devices according to claim 17, is characterized in that, described device also comprises the 3rd acquisition module, chooses module, labeling module, the second training module and sort module, wherein:

Described the 3rd acquisition module is used for obtaining multiple users' historical input message, and chooses module described in described multiple users' historical input message is sent to;

The described module of choosing is for the historical input message of the historical input message selected part from the described multiple users that obtain, and historical the described part of choosing input message is sent to described labeling module;

Described labeling module, for the historical input message of the described part of choosing is marked, obtains the corpus of multiple user annotations, and the corpus of multiple user annotations of described acquisition is sent to described the second training module;

Described the second training module is for by the corpus of described multiple user annotations and described G different user type, trains the user sorter relevant to user type respectively with the machine learning method of teacher ' s guidance according to user type separately;

Described sort module is used for according to active user's historical input message, and the relevant user's sorter of described and user type obtaining with described the second training module is classified to active user, obtains the affiliated user type set of active user.

19. according to the device described in claim 16-18 any one, it is characterized in that, described historical input message is included in any one or the two or more historical input message in historical input message, the historical input message in JICQ and the historical input message in social network sites in input method application.

20. devices according to claim 15, is characterized in that, described the first order module comprises weights acquiring unit and sequencing unit:

Described weights acquiring unit is used for according to the correlativity size of the user type of the user type set under the user type under each candidate item and described active user, obtain the weights of each candidate item in the candidate item set of the current input message of active user, and the weights of each candidate item in described candidate item set are sent to sequencing unit;

Described sequencing unit, for according to the size of the weights of the each candidate item of described candidate item set from described weights acquiring unit, sorts to the candidate item in the candidate item set of the current input message of active user.

21. devices according to claim 20, is characterized in that, described weights acquiring unit comprises that first obtains subelement, second and obtain subelement, the 3rd and obtain subelement and the 4th and obtain subelement, wherein:

Described first obtains subelement for obtaining m user u ₁, u ₂..., u _mthe in the situation that of the current input message of the described active user of input, to same candidate item c _iselection number of times s ₁, s ₂..., s _m, wherein, m user belongs to G different user type;

Described second obtains subelement for G different user type, obtains described candidate item c _iweight w eight (c in user type g _i, g), that is:

weight (c_{i}, g) = \frac{P_{g} (c_{i})}{Σ_{g &Element; G} P_{g} (c_{i})}

Wherein, P _g(c _i) be illustrated under the language model that customer group g is corresponding candidate c _iprobability;

weight (u_{m}, g) = \frac{P_{g} (\log of u_{m})}{Σ_{g &Element; G} P_{g} (\log of u_{m})}

{weight}^{k} (c_{i}, u_{m}) = \underset{g &Element; g_{m}}{Σ} weight (c_{i}, g) \times \frac{s_{m} \times weight (u_{m}, g)}{Σ_{g &Element; g_{m}} weight (u_{m}, g)} - {\cos t}^{k} (c_{i}, u_{m})

22. devices according to claim 21, is characterized in that, also comprise judge module, for according to the size of the weights of the each candidate item of described candidate item set, determine in candidate item set, whether have high frequency heat word or neologisms.

23. devices according to claim 22, it is characterized in that, described judge module, specifically in the time that the described weights that predetermined time iteration produces continuously of candidate item in candidate item set are all greater than the threshold value of default high frequency heat word, determines that described candidate item is high frequency heat word; Or variable quantity compared with the described weights that produce with a front iteration specifically for the described weights that produce when candidate item current iteration in candidate item set is while being greater than the threshold value of default neologisms, determines that described candidate item is neologisms.

24. devices according to claim 22, is characterized in that, described device also comprises pushing module, push described high frequency heat word or link corresponding to neologisms for the user of the user type under described high frequency heat word or neologisms.

25. devices according to claim 22, is characterized in that, described display module is specifically for showing candidate item and the neologisms of candidate item or the mark of high frequency heat word after sorting.

26. devices according to claim 21, is characterized in that, described device also comprises the second order module, for according to active user's switching command, according to described selection number of times s ₁, s ₂..., s _msummed result described candidate item is sorted, and will send to described display module by the candidate item after the sequence of described summed result;

Described display module is specifically for showing the candidate item and the described summed result that sort by described summed result.

27. according to the device described in claim 15 or 16, it is characterized in that, described language model is n-gram language model or n-pos language model.