CN104182390A - Method and system for personalizing user information - Google Patents

Method and system for personalizing user information Download PDF

Info

Publication number
CN104182390A
CN104182390A CN201410400307.0A CN201410400307A CN104182390A CN 104182390 A CN104182390 A CN 104182390A CN 201410400307 A CN201410400307 A CN 201410400307A CN 104182390 A CN104182390 A CN 104182390A
Authority
CN
China
Prior art keywords
word
participle
user
model
compound
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201410400307.0A
Other languages
Chinese (zh)
Other versions
CN104182390B (en
Inventor
吴先超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Baidu Online Network Technology Beijing Co Ltd
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201410400307.0A priority Critical patent/CN104182390B/en
Publication of CN104182390A publication Critical patent/CN104182390A/en
Application granted granted Critical
Publication of CN104182390B publication Critical patent/CN104182390B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The invention provides a method and system for personalizing user information. The method comprises the following steps: acquiring a compound word which does not exist in a word segmentation model of a user; segmenting the compound word to determine the part of speech of each segmented word according to an existing first corpus which is a training corpus in the word segmentation model and is determined according to the historical information of the user; determining the part of speech of the compound word according to the part of speech of each segmented word and a pre-established mapping table which is used for indicating the corresponding relation between the series of the part of speech of the segmented words and the part of speech of the integral compound word; correspondingly saving the compound word and the part of speech of the compound word in the word segmentation model, so as to obtain the personalized word segmentation model of the user. According to the method provided by the embodiment of the invention, the part of speech of the compound word can be determined through the part of speech of each segmented word and the mapping table to obtain the personalized word segmentation model of the user, so that the word segmentation efficiency can be improved, and various use requirements of different users can be met.

Description

User profile is carried out to the method and system of personalisation process
Technical field
The present invention relates to input method technique field, particularly a kind of method and system of user profile being carried out to personalisation process.
Background technology
User's typewriting input habit varies with each individual, and different users has the different words of cutting to be accustomed to, and in inputting the language such as Chinese Japanese user, generally reaches user's input object by transfer keys such as space bars.
For Japanese, some user likes inputting continuously a lot of assumed name, comprises the auxiliary word of compound word and front and back etc.Some user's input behavior is more conservative, first inputs an assumed name that compound word is corresponding, then presses transfer key, connects auxiliary word etc. more afterwards after input.
Existing input method is that the portmanteau word or the compound word that emphatically general user are often used are formulated for object, does not consider each user's individual demand and input habit, cannot meet the demand of different user, can affect thus user's experience.
Summary of the invention
Object of the present invention is intended at least solve one of above-mentioned technological deficiency.
For this reason, one object of the present invention is to propose a kind of method of user profile being carried out to personalisation process.
Another object of the present invention is to propose a kind of system of user profile being carried out to personalisation process.
For achieving the above object, the embodiment of one aspect of the present invention proposes a kind of method of user profile being carried out to personalisation process, comprises the following steps: obtain compound word, described compound word is the compound word not having in user's participle model; According to existing the first language material, described compound word is carried out to the product word of participle definite each participle, described the first language material is the corpus in described participle model, described existing the first language material is to determine according to described user's historical information; According to the product word of described each participle and the mapping table of setting up in advance, determine the product word of described compound word, described mapping table is for showing the corresponding relation between product word string and overall product word; Corresponding being kept in described participle model of product word by described compound word with described compound word, obtains described user's personalized participle model.
According to the method for the embodiment of the present invention, the product word by each participle and mapping table are determined the product word of compound word, to obtain user's personalized participle model, thereby can improve participle efficiency, meet the various user demands of different user.
In one embodiment of the invention, described in obtain described user's personalized participle model after, described method also comprises: obtain the second language material, described the second language material is the corpus in input method model; According to described personalized participle model, described the second language material is re-started to participle, obtain described user's personalized input method model.
In one embodiment of the invention, described in obtain described user's personalized input method model after, described method also comprises: the character that receives user input; According to the character of described input and described personalized input method model, show the word corresponding with described character to described user, described word comprises at least one word.
In one embodiment of the invention, also comprise: collect compound word, and the compound word of collecting is marked to overall product word; The participle and the participle product word that obtain the compound word of collection, obtain the product word string being made up of participle product word; Set up the corresponding relation of described product word string and described overall product word, to obtain described mapping table.
In one embodiment of the invention, described in obtain the second language material, comprising: from user log files, obtain compound language that frequency of utilization is greater than predetermined threshold value as described the second language material.
In one embodiment of the invention, described according to described personalized participle model, described the second language material is re-started to participle, comprise: if described the second language material comprises Part I, described Part I is made up of at least two words presetting granularity, and described Part I is in described personalized participle model, using described Part I as a compound word.
In one embodiment of the invention, described according to the character of described input and described personalized input method model, show the word corresponding with described character to described user, comprise: if according to default probabilistic algorithm, the output probability maximum of Part I described in the time being input as described character, described Part I is showed to described user as a compound word entirety, wherein, the product word that described Part I and described Part I are corresponding participates in described probabilistic algorithm as a whole.
The present invention embodiment has on the other hand proposed a kind of system of user profile being carried out to personalisation process, comprising: the first acquisition module, the compound word not having for obtaining user's participle model; Product word determination module, be used for according to existing the first language material, the product word that described compound word is carried out to participle definite each participle, described the first language material is the corpus in described participle model, described existing the first language material is to determine according to described user's historical information; Mapping table is set up module, for according to the product word of described each participle and the mapping table of setting up in advance, determines the product word of described compound word, and described mapping table is for showing the corresponding relation between product word string and overall product word; Participle model is set up module, for being kept at described participle model by corresponding the product word of described compound word and described compound word, obtains described user's personalized participle model.
According to the system of the embodiment of the present invention, the product word by each participle and mapping table are determined the product word of compound word, to obtain user's personalized participle model, thereby can improve participle efficiency, meet the various user demands of different user.
In one embodiment of the invention, also comprise: the second acquisition module, for obtaining the second language material, described the second language material is the corpus in input method model; Input method model generation module, for according to described personalized participle model, re-starts participle to described the second language material, obtains described user's personalized input method model.
In one embodiment of the invention, described input method model generation module also for, according to character and the described personalized input method model of user input, show the word corresponding with described character to described user, described word comprises at least one word.
In one embodiment of the invention, described mapping table is set up module for the compound word of collecting is marked to overall product word, and obtain participle and the participle product word of the compound word of collection, obtain the product word string being formed by participle product word, to set up the corresponding relation of described product word string and described overall product word, to obtain described mapping table.
In one embodiment of the invention, described the second acquisition module obtains compound language that frequency of utilization is greater than predetermined threshold value as described the second language material from user log files.
In one embodiment of the invention, in the time that described the second language material comprises Part I, described Part I is made up of at least two words presetting granularity, and described input method model generation module is in described personalized participle model, using described Part I as a compound word.
In one embodiment of the invention, when described in the time being input as described character, the output probability of Part I is maximum, described input method model generation module shows described user using described Part I as a compound word entirety, wherein, the product word that described Part I and described Part I are corresponding participates in described probabilistic algorithm as a whole.
The aspect that the present invention is additional and advantage in the following description part provide, and part will become obviously from the following description, or recognize by practice of the present invention.
Brief description of the drawings
The present invention above-mentioned and/or additional aspect and advantage will become from the following description of the accompanying drawings of embodiments obviously and easily and understand, wherein:
Fig. 1 is the process flow diagram that according to an embodiment of the invention user profile is carried out the method for personalisation process;
Fig. 2 is the component of compound word according to an embodiment of the invention and the mapping relations schematic diagram of this compound language;
Fig. 3 is the component of compound word in accordance with another embodiment of the present invention and the mapping relations schematic diagram of this compound language;
Fig. 4 is according to the calculation process schematic diagram of the transition probability between the product word of the embodiment of the present invention;
Fig. 5 adds the process schematic diagram in participle language material to new compound language according to an embodiment of the invention;
Fig. 6 inputs to the user of different input habits the schematic diagram that assumed name is predicted;
Fig. 7 is mobile terminal Chinese and japanese input method schematic diagram according to an embodiment of the invention;
Fig. 8 is the structured flowchart that according to an embodiment of the invention user profile is carried out the system of personalisation process; And
Fig. 9 is the structured flowchart that in accordance with another embodiment of the present invention user profile is carried out the system of personalisation process.
Embodiment
Describe embodiments of the invention below in detail, the example of embodiment is shown in the drawings, and wherein same or similar label represents same or similar element or has the element of identical or similar functions from start to finish.Be exemplary below by the embodiment being described with reference to the drawings, be intended to for explaining the present invention, and can not be interpreted as limitation of the present invention.
In description of the invention, it will be appreciated that, term " " center ", " longitudinally ", " laterally ", " length ", " width ", " thickness ", " on ", D score, " front ", " afterwards ", " left side ", " right side ", " vertically ", " level ", " top ", " end " " interior ", " outward ", " clockwise ", orientation or the position relationship of instructions such as " counterclockwise " are based on orientation shown in the drawings or position relationship, only the present invention for convenience of description and simplified characterization, instead of device or the element of instruction or hint indication must have specific orientation, with specific orientation structure and operation, therefore can not be interpreted as limitation of the present invention.
In addition, term " first ", " second " be only for describing object, and can not be interpreted as instruction or hint relative importance or the implicit quantity that indicates indicated technical characterictic.Thus, one or more these features can be expressed or impliedly be comprised to the feature that is limited with " first ", " second ".In description of the invention, the implication of " multiple " is two or more, unless otherwise expressly limited specifically.
In the present invention, unless otherwise clearly defined and limited, the terms such as term " installation ", " being connected ", " connection ", " fixing " should be interpreted broadly, and for example, can be to be fixedly connected with, and can be also to removably connect, or connect integratedly; Can be mechanical connection, can be also electrical connection; Can be to be directly connected, also can indirectly be connected by intermediary, can be the connection of two element internals.For the ordinary skill in the art, can understand as the case may be above-mentioned term concrete meaning in the present invention.
In the present invention, unless otherwise clearly defined and limited, First Characteristic Second Characteristic it " on " or D score can comprise that the first and second features directly contact, also can comprise that the first and second features are not directly contacts but by the other feature contact between them.And, First Characteristic Second Characteristic " on ", " top " and " above " comprise First Characteristic directly over Second Characteristic and oblique upper, or only represent that First Characteristic level height is higher than Second Characteristic.First Characteristic Second Characteristic " under ", " below " and " below " comprise First Characteristic under Second Characteristic and tiltedly, or only represent that First Characteristic level height is less than Second Characteristic.
Fig. 1 be the embodiment of the present invention user profile is carried out to the process flow diagram of the method for personalisation process.As shown in Figure 1, according to the embodiment of the present invention user profile is carried out to the method for personalisation process, comprise the following steps:
Step S101, obtains the compound word not having in user's participle model.
Step S102, according to existing the first language material, carries out the product word of participle definite each participle to compound word, the first language material is the corpus in participle model, and existing the first language material is to determine according to user's historical information.
Step S103, according to the product word of each participle and the mapping table of setting up in advance, determines the product word of compound word, and mapping table is for showing the corresponding relation between product word string and overall product word.
Particularly, collect compound word, and the compound word of collecting is marked to overall product word, obtain participle and the participle product word of the compound word of collection, obtain the product word string being formed by participle product word.Set up the corresponding relation of product word string and overall product word, to obtain mapping table.
Particularly, for a character string of user's input, " feelings of asking for help Reported retrieval; 1288,1288,3273 Ming Words, Gu You Ming Words is general; *, *, *; the feelings of asking for help Reported retrieval, キ ュ ウ ジ Application ジ ョ ウ ホ ウ ケ Application サ Network, キ ュ ウ ジ Application ジ ョ ウ ホ ウ ケ Application サ Network ", wherein, compound word " feelings of asking for help Reported retrieval " forms as shown in table 1 below by three elements:
Ask for help Ming Words, サ Bian Jie continued, *, *, *, *, asks for help, キ ュ ウ ジ Application, キ ュ ー ジ Application
Feelings Reported Ming Words, general, *, *, *, *, feelings Reported, ジ ョ ウ ホ ウ, ジ ョ ー ホ ー
Retrieval Ming Words, サ Bian Jie continued, *, *, *, *, retrieval, ケ Application サ Network, ケ Application サ Network
Table 1
For this compound word " feelings of asking for help Reported retrieval ", set up the component of compound word and the mapping relations of this compound word as shown in Figure 2.
For a character string of user's input, " feelings Reported retrieval, 1285,1285,2209, name Words; general, *, *, *, *; feelings Reported retrieval, ジ ョ ウ ホ ウ ケ Application サ Network, ジ ョ ウ ホ ウ ケ Application サ Network ", wherein, compound word " feelings Reported retrieval " forms as shown in table 2 below by two elements:
Feelings Reported Ming Words, general, *, *, *, *, feelings Reported, ジ ョ ウ ホ ウ, ジ ョ ー ホ ー
Retrieval Ming Words, サ Bian Jie continued, *, *, *, *, retrieval, ケ Application サ Network, ケ Application サ Network
Table 2
For this compound word " feelings Reported retrieval ", the product word string of setting up element " feelings Reported " and " retrieval " is mapped to the mapping relations of compound word.For this compound word " feelings Reported retrieval ", set up the component of compound word and the mapping relations of this compound word as shown in Figure 3.
By associated by shining upon between the product word of compound word and its element, make its compound word not to be carried out to sectionalization processing and facilitated the calculating of each transition probability in participle model, and improved from input assumed name to obtaining the conversion precision of optimal candidate compound word and the accuracy rate of prediction.This transition probability is the probability that an element transforms to the compound word relevant to this element, for element " feelings Reported ", the transition probability of element " feelings Reported " and compound word " feelings Reported retrieval " is inputted the probability of getting " feelings Reported retrieval " after " feelings Reported " for user.
Step S104, corresponding being kept in participle model of product word by compound word with compound word, obtains user's personalized participle model.
After obtaining user's personalized participle model, from user log files, obtain compound language that frequency of utilization is greater than predetermined threshold value as the corpus in input method model, i.e. the second language material.According to personalized participle model, the second language material is re-started to participle again, obtain user's personalized input method model.Wherein, if the second language material comprises Part I, Part I is made up of at least two words presetting granularity, and Part I is in personalized participle model, using Part I as a compound word.
In one embodiment of the invention, can from user's journal file, obtain multiple compound words that frequency of utilization is greater than threshold value.This frequency of utilization can arrange according to user's demand.In Japanese, compound word comprises complex noun, compound verb etc., for example, when the probability of use of the compound nominalization " feelings of asking for help Reported retrieval " obtaining is 12% (given threshold is 10%), from journal file, choose " feelings of asking for help Reported retrieval " from user log files.
In one embodiment of the invention, receive the character of user's input, and according to character and the personalized input method model of input, show the word corresponding with character to user, word comprises at least one word.Wherein, if according to default probabilistic algorithm, the output probability maximum of Part I in the time being input as character, shows user using Part I as a compound word entirety, wherein, the product word that Part I and Part I are corresponding participates in probabilistic algorithm as a whole.
Participle model comprises the transition probability between element and the compound word that compound word is corresponding, according to transition probability, the assumed name of corresponding compound word or Chinese character are added in candidate list, selects for user.
In one embodiment of the invention, user's personalized input method model comprises transition probability, and the compound word that transition probability is greater than to preset value represents by following formula, and this formula is, wherein, x represents the assumed name sequence that user inputs, and y represents to return to user's compound word, represent the compound word while making transition probability P (y|x) be greater than threshold value, P (y) represents that transition probability is greater than the probability that the compound word of preset value is y, and P (x|y) represents the probability that the pronunciation of compound word y is x.That supposes y comprises n assumed name or Chinese character, i.e. y=w 1, w 2..., w n.The probability P (x|y) that the probability P (y) that the compound word that transition probability can be greater than to preset value is y and the pronunciation of compound word y are x is decomposed following formula and is represented, formula is respectively, wherein, w irepresent Chinese character or the assumed name of compound word, c irepresent corresponding to w iproduct word, r irepresent w iassumed name pronunciation, P (w i| c i) transition probability of expression from product word to compound word, P (c i| c i-1) represent from product word c i-1to c itransition probability, P (r i| w i) expression w ipronunciation be r iprobability.The for example multiple compound words taking " asking for help " as compound word element " feelings of asking for help Reported " and " feelings of asking for help Reported retrieval ", judge from the element probability of transfer (getting) to the probability of " feelings of asking for help Reported " and transfer (getting) to " feelings of asking for help Reported retrieval " of " asking for help " according to participle model and user's input habit.When shifting the probability of (getting) to " feelings of asking for help Reported retrieval " when large, the top of will " feelings of asking for help Reported retrieval " adding " feelings of asking for help Reported " in candidate list to, to facilitate user.
After multiple elements are combined to be a compound word, this compound word correspondence multiple product word sequences, therefore we need to recalculate the transition probability between product word, and the product word sequence forming according to multiple product words is to obtain the transition probability of this compound word.
Fig. 4 is according to the calculation process schematic diagram of the transition probability between the product word of the embodiment of the present invention.As shown in Figure 4, one of them compound word " w1w2w3 ", it is by three elements " w1 ", " w2 ", and " w3 " form, the product word of these three elements is respectively " c1 ", " c2 ", and " c3 ".As shown in Figure 4, the corpus that after extracting through compound word, institute's participle model uses is as shown in table 3,
w0 c0
w1w2w3 new.c123
w4 c4
Table 3
By above-mentioned processing, in the time again there is w1w2w3, as a compound word but not the probability of three words rise.By which, set up new participle model (training of participle model adopts existing " condition random field " method).
Due to w1, w2, w3 may occur as autonomous word in corpus, w1w2w3 constitutes a compound word, for example, w1w2w3; Or the probability of three words, for example w1w2w3; All likely exist.Concrete which kind of word segmentation result of selecting, depends on w1, w2, point other number of times independently occurring of w3, and w1w2w3 combine the number of times as a word appearance.In embodiments of the invention, word segmentation result is that w1w2w3 is trained as a compound word.
Therefore, by as above processing, can learn, again run into w1w2w3 time, set it as a word but not the probability of three words can rise.Thereby by this processing for language material, set up new participle model, the training of participle model adopts " condition random field " method.
Consider w1, w2, w3 still has the possibility of existence in corpus as autonomous word, and w1w2w3 is as the probability of a word, and for example, as two words, w1w2w3; Or the probability of three words, for example w1w2w3; All exist.Concrete which kind of word segmentation result of selecting, depends on w1, w2, point other number of times independently occurring of w3, and w1w2w3 combine the number of times as a word appearance.
Specifically train new input method model according to result after new participle (w1w2w3 is as a word), as shown in Figure 4.Adopt conditional random fields (CRFs), i.e. conditional random field models.Wherein, CRFs is optimized on overall probability model, is a kind of non-directed graph model.For the chain type CRFs that is applied to participle problem, whole observation sequence acts on each state, also there is no the redirect relation of directive property between state.
For example Markov random field, condition random field is to have undirected graph model, the summit in figure represents stochastic variable, line between summit represents the dependence relation between stochastic variable, in condition random field, the condition that the is distributed as probability of stochastic variable Y, given observed value is stochastic variable X.In principle, the graph model layout of condition random field can be any given, general conventional layout is the framework of chain eliminant, no matter chain eliminant framework is upper in training (training), inference (inference) or decoding (decoding), all exist the algorithm that efficiency is higher can supply calculation.Random field can be regarded the set (the corresponding same sample space of this group stochastic variable) of one group of stochastic variable as.After giving at random a value when distributing according to certain to each position, it is all just called random field.
The corresponding non-directed graph of markov random file (MRF).The corresponding stochastic variable of each node on non-directed graph, the limit between node represents there is probability dependence between stochastic variable that node is corresponding.Therefore, on the structural nature of MRF, reacted priori, between which variable, have dependence to need to consider, and which can have been ignored.
There is Markov character: with the distant factor of current factor, less to the image of current factor, therefore can ignore, wherein its distance can be determined as the case may be.
Suppose below each stochastic variable, to also have observed value in given MRF, need to determine under given observation set, the distribution of this MRF, namely condition distributes, and MRF is called CRF.Its condition distribution form is quite analogous to the distribution form of MRF, observes set x for just many one.
Global angle, CRF the has been given in essence MRF of observed value (observations) set.
In one embodiment of the invention, compound word comprises a kind of in portmanteau word and user individual term or many groups, Chinese idiom or the two-part allegorical saying etc. such as " large general the running quickly of happiness ", " on tall and big " that this compound word is similar with Chinese, and user individual term refers to the word of getting on user habit, some user habit is inputted multiple assumed names and is got, and some user inputs several times after a small amount of assumed name is changed again and gets.In one embodiment of the invention, to be user input multiple words together traditionally user individual term in the time of input or simplify the terms such as input word.In embodiments of the invention, compound word, portmanteau word and user individual user are referred to as to compound word.In the time of user's input element, the element of inputting forms an assumed name sequence being made up of multiple assumed names, and for example, for " asking for help ", the assumed name sequence of inputting is " I ゅ う じ ん ".
In one embodiment of the invention, Japanese is introduced simply.Japanese comprises assumed name (being divided into hiragana and katakana), Chinese character and Roman capitals.Simply introduce assumed name (hiragana and katakana) and Chinese character below.For following example: こ れ は this Language of day テ キ ス ト In The.In (translation: this is Japanese textbook) this sentence of hiragana, " こ れ は ", " ", " In The " are exactly hiragana.Hiragana is a part very important in Japanese, and it can directly form word." this Language of " is Chinese character, its hiragana is the similar and Chinese phonetic (it is not real phonetic certainly) of the assumed name in " To ほ ん ご " Japanese, how can know this assumed name " reading " by assumed name, in the time of input " To ほ ん ご ", can hiragana " To ほ ん ご " and " this Language of " be changed by space bar.And katakana generally represents that in the above-mentioned example such as foreign word or foreign name, " テ キ ス ト " is katakana.Therefore, if first input Chinese character will input corresponding assumed name, and then be transformed into corresponding Chinese character.Fig. 5 adds the process schematic diagram in participle language material to new compound word according to an embodiment of the invention.As shown in Figure 5, user log files is regularly scanned, adds up to determine new compound word.By participle instrument, new compound word is carried out to participle, part-of-speech tagging and assumed name phonetic notation, to obtain the mapping table (being mapping relations) between new compound word and its corresponding element.According to more row participle model of mapping table.In participle model after user inputs renewal, when the corresponding element assumed name of new compound word, according to the participle model after upgrading, the assumed name of new compound word or Chinese character are shown in candidate list, select for user.
Fig. 6 inputs to the user of different input habits the schematic diagram that assumed name is predicted.As shown in Figure 6, for more radical user 1 (with respect to user 2, input longer assumed name, conversion afterwards), the conversion of its input method and prediction aspect are all more radical, for example, before the compound word candidate that Length Ratio is larger has been arranged on,, when user 1 inputs " じ ょ う ほ う け ん さ く ", convert corresponding compound word " feelings Reported retrieval " is fed back to user 1.For more conservative user 2 (with respect to user 1, the assumed name that its input is shorter, conversion afterwards), the conversion of its input method and prediction aspect are all more conservative, be that the shorter word of Length Ratio has been arranged on relatively forward anterior locations, for example, when user 2 inputs " じ ょ う ほ う ", convert corresponding compound word " feelings Reported " and " feelings Reported retrieval " are fed back to user 2, and " feelings Reported " is in the front of " feelings Reported retrieval ".
Fig. 7 is mobile terminal Chinese and japanese input method schematic diagram according to an embodiment of the invention.As shown in Figure 7, for the Japanese input of nine grids form, in the time that user presses " あ " and do not put, there will be the ejection shown in this figure the right, other four assumed names of " あ " row be distributed in " あ " and Si Ge position, bottom right, upper left.When inputting successively あ->->, " あ " this assumed name is mated out, and its corresponding candidate is enumerated out.
According to the method for the embodiment of the present invention, the product word by each participle and mapping table are determined the product word of compound word, to obtain user's personalized participle model, thereby can improve participle efficiency, meet the various user demands of different user.
Fig. 8 is the structured flowchart that according to an embodiment of the invention user profile is carried out the system of personalisation process.As shown in Figure 8, comprise according to the system that user profile is carried out to personalisation process of the embodiment of the present invention: the first acquisition module 100, product word determination module 200, mapping table sets up module 300 and participle model is set up module 400.
Particularly, the compound word that the first acquisition module 100 does not have for obtaining user's participle model.Product word determination module 200, for according to existing the first language material, carries out the product word of participle definite each participle to compound word, the first language material is the corpus in participle model, and existing the first language material is to determine according to user's historical information.Mapping table is set up module 300 for according to the product word of each participle and the mapping table of setting up in advance, determines the product word of compound word, and mapping table is for showing the corresponding relation between product word string and overall product word.Participle model is set up module 400 for the corresponding participle model that is kept at of product word with compound word by compound word, obtains user's personalized participle model.
In one embodiment of the invention, mapping table is set up module 300 compound word of collecting is marked to overall product word, and obtains participle and the participle product word of the compound word of collection, obtains the product word string being made up of participle product word, to set up the corresponding relation of product word string and overall product word, to obtain mapping table.
For a character string of user's input, " feelings Reported retrieval, 1285,1285,2209, name Words; general, *, *, *, *; feelings Reported retrieval, ジ ョ ウ ホ ウ ケ Application サ Network, ジ ョ ウ ホ ウ ケ Application サ Network ", wherein, compound word " feelings Reported retrieval " forms as shown in table 2 by two elements.For this compound word " feelings Reported retrieval ", mapping table is set up module 300 and sets up the product word string of element " feelings Reported " and " retrieval " and be mapped to the mapping relations of compound word.For this compound word " feelings Reported retrieval ", mapping table is set up mapping table that module 300 sets up the component of compound word and the mapping relations of this compound word as shown in Figure 3.
Participle model comprises transition probability, by associated by shining upon between the product word of compound word and its element, make its compound word not to be carried out to sectionalization processing, but improved from input assumed name to obtaining the conversion precision of optimal candidate compound word and the accuracy rate of prediction by the transition probability of participle model.This transition probability is the probability that an element transforms to the compound word relevant to this element, for element " feelings Reported ", the transition probability of element " feelings Reported " and compound word " feelings Reported retrieval " is inputted the probability of getting " feelings Reported retrieval " after " feelings Reported " for user.
Fig. 9 is the structured flowchart that in accordance with another embodiment of the present invention user profile is carried out the system of personalisation process.As shown in Figure 9, according to the embodiment of the present invention user profile is carried out to the system of personalisation process, also comprise: the second acquisition module 500 and input method model generation module 600.
Particularly, the second acquisition module 500 is for obtaining the second language material, and the second language material is the corpus in input method model.Input method model generation module 600, for according to personalized participle model, re-starts participle to the second language material, obtains user's personalized input method model.
In one embodiment of the invention, the second acquisition module 500 obtains compound language that frequency of utilization is greater than predetermined threshold value as the second language material from user log files.Input method model generation module 600, according to character and the personalized input method model of user's input, is shown the word corresponding with character to user, and word comprises at least one word.When in the time being input as character, the output probability of Part I is maximum, input method model generation module 600 shows user using Part I as a compound word entirety, and wherein, the product word that Part I and Part I are corresponding participates in probabilistic algorithm as a whole.In addition, in the time that the second language material comprises Part I, Part I is made up of at least two words presetting granularity, and input method model generation module 600 is in personalized participle model, using Part I as a compound word.
In one embodiment of the invention, the second acquisition module 500 regularly scans, adds up to determine new compound word to user log files.By participle instrument, new compound word is carried out to participle, part-of-speech tagging and assumed name phonetic notation, to obtain the mapping table between new compound word and its corresponding element.According to more row participle model of mapping table.
According to the system of the embodiment of the present invention, the product word by each participle and mapping table are determined the product word of compound word, to obtain user's personalized participle model, thereby can improve participle efficiency, meet the various user demands of different user.
It should be noted that, the processing procedure of the module of system of the present invention corresponding with said method therefore at this not in repeat specification.
Although illustrated and described embodiments of the invention above, be understandable that, above-described embodiment is exemplary, can not be interpreted as limitation of the present invention, those of ordinary skill in the art can change above-described embodiment within the scope of the invention in the situation that not departing from principle of the present invention and aim, amendment, replacement and modification.

Claims (14)

1. a method of user profile being carried out to personalisation process, is characterized in that, comprises the following steps:
Obtain compound word, described compound word is the compound word not having in user's participle model;
According to existing the first language material, described compound word is carried out to the product word of participle definite each participle, described the first language material is the corpus in described participle model, described existing the first language material is to determine according to described user's historical information;
According to the product word of described each participle and the mapping table of setting up in advance, determine the product word of described compound word, described mapping table is for showing the corresponding relation between product word string and overall product word;
Corresponding being kept in described participle model of product word by described compound word with described compound word, obtains described user's personalized participle model.
2. method according to claim 1, is characterized in that, described in obtain described user's personalized participle model after, described method also comprises:
Obtain the second language material, described the second language material is the corpus in input method model;
According to described personalized participle model, described the second language material is re-started to participle, obtain described user's personalized input method model.
3. method according to claim 2, is characterized in that, described in obtain described user's personalized input method model after, described method also comprises:
Receive the character of user's input;
According to the character of described input and described personalized input method model, show the word corresponding with described character to described user, described word comprises at least one word.
4. method according to claim 1, is characterized in that, also comprises:
Collect compound word, and the compound word of collecting is marked to overall product word;
The participle and the participle product word that obtain the compound word of collection, obtain the product word string being made up of participle product word;
Set up the corresponding relation of described product word string and described overall product word, to obtain described mapping table.
5. method according to claim 2, is characterized in that, described in obtain the second language material, comprising:
From user log files, obtain compound language that frequency of utilization is greater than predetermined threshold value as described the second language material.
6. method according to claim 2, is characterized in that, described according to described personalized participle model, and described the second language material is re-started to participle, comprising:
If described the second language material comprises Part I, described Part I is made up of at least two words presetting granularity, and described Part I is in described personalized participle model, using described Part I as a compound word.
7. method according to claim 3, is characterized in that, described according to the character of described input and described personalized input method model, shows the word corresponding with described character to described user, comprising:
If according to default probabilistic algorithm, the output probability maximum of Part I described in the time being input as described character, described Part I is showed to described user as a compound word entirety, wherein, the product word that described Part I and described Part I are corresponding participates in described probabilistic algorithm as a whole.
8. a system of user profile being carried out to personalisation process, is characterized in that, comprising:
The first acquisition module, the compound word not having for obtaining user's participle model;
Product word determination module, be used for according to existing the first language material, the product word that described compound word is carried out to participle definite each participle, described the first language material is the corpus in described participle model, described existing the first language material is to determine according to described user's historical information;
Mapping table is set up module, for according to the product word of described each participle and the mapping table of setting up in advance, determines the product word of described compound word, and described mapping table is for showing the corresponding relation between product word string and overall product word;
Participle model is set up module, for being kept at described participle model by corresponding the product word of described compound word and described compound word, obtains described user's personalized participle model.
9. system according to claim 8, is characterized in that, also comprises:
The second acquisition module, for obtaining the second language material, described the second language material is the corpus in input method model;
Input method model generation module, for according to described personalized participle model, re-starts participle to described the second language material, obtains described user's personalized input method model.
10. system according to claim 9, it is characterized in that, described input method model generation module also for, according to character and the described personalized input method model of user input, show the word corresponding with described character to described user, described word comprises at least one word.
11. systems according to claim 8, it is characterized in that, described mapping table is set up module for the compound word of collecting is marked to overall product word, and obtain participle and the participle product word of the compound word of collection, obtain the product word string being formed by participle product word, to set up the corresponding relation of described product word string and described overall product word, to obtain described mapping table.
12. systems according to claim 9, is characterized in that, described the second acquisition module obtains compound language that frequency of utilization is greater than predetermined threshold value as described the second language material from user log files.
13. systems according to claim 9, it is characterized in that, in the time that described the second language material comprises Part I, described Part I is made up of at least two words presetting granularity, described input method model generation module is in described personalized participle model, using described Part I as a compound word.
14. systems according to claim 10, it is characterized in that, when described in the time being input as described character, the output probability of Part I is maximum, described input method model generation module shows described user using described Part I as a compound word entirety, wherein, the product word that described Part I and described Part I are corresponding participates in described probabilistic algorithm as a whole.
CN201410400307.0A 2014-08-14 2014-08-14 The method and system of personalisation process are carried out to user profile Active CN104182390B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410400307.0A CN104182390B (en) 2014-08-14 2014-08-14 The method and system of personalisation process are carried out to user profile

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410400307.0A CN104182390B (en) 2014-08-14 2014-08-14 The method and system of personalisation process are carried out to user profile

Publications (2)

Publication Number Publication Date
CN104182390A true CN104182390A (en) 2014-12-03
CN104182390B CN104182390B (en) 2017-08-18

Family

ID=51963450

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410400307.0A Active CN104182390B (en) 2014-08-14 2014-08-14 The method and system of personalisation process are carried out to user profile

Country Status (1)

Country Link
CN (1) CN104182390B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1271132A (en) * 1999-04-15 2000-10-25 松下电器产业株式会社 Chinese character converter using grammar information
JP2001249921A (en) * 2000-03-03 2001-09-14 Nippon Telegr & Teleph Corp <Ntt> Compound word analysis method and device and recording medium having compound word analysis program recorded thereon
CN102929864A (en) * 2011-08-05 2013-02-13 北京百度网讯科技有限公司 Syllable-to-character conversion method and device
CN103870472A (en) * 2012-12-11 2014-06-18 百度国际科技(深圳)有限公司 Digging method and device for compound words

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1271132A (en) * 1999-04-15 2000-10-25 松下电器产业株式会社 Chinese character converter using grammar information
JP2001249921A (en) * 2000-03-03 2001-09-14 Nippon Telegr & Teleph Corp <Ntt> Compound word analysis method and device and recording medium having compound word analysis program recorded thereon
CN102929864A (en) * 2011-08-05 2013-02-13 北京百度网讯科技有限公司 Syllable-to-character conversion method and device
CN103870472A (en) * 2012-12-11 2014-06-18 百度国际科技(深圳)有限公司 Digging method and device for compound words

Also Published As

Publication number Publication date
CN104182390B (en) 2017-08-18

Similar Documents

Publication Publication Date Title
CN100527125C (en) On-line translation model selection method of statistic machine translation
US10037319B2 (en) User input prediction
CN100458795C (en) Intelligent word input method and input method system and updating method thereof
US20170220129A1 (en) Predictive Text Input Method and Device
CN105917327A (en) System and method for inputting text into electronic devices
CN102455845B (en) Character entry method and device
CN101359254B (en) Character input method and system for enhancing input efficiency of name entry
CN105608218A (en) Intelligent question answering knowledge base establishment method, establishment device and establishment system
Martin et al. Bias robust estimation of scale
CN101866337A (en) Part-or-speech tagging system, and device and method thereof for training part-or-speech tagging model
KR20140119734A (en) A system and method for inputting text
CN101158969B (en) Whole sentence generating method and device
CN102156551A (en) Method and system for correcting error of word input
CN101131706A (en) Query amending method and system thereof
CN104008166A (en) Dialogue short text clustering method based on form and semantic similarity
CN103678282A (en) Word segmentation method and device
CN102194149B (en) Community discovery method
CN108427715A (en) A kind of social networks friend recommendation method of fusion degree of belief
CN111985228B (en) Text keyword extraction method, text keyword extraction device, computer equipment and storage medium
CN100555277C (en) A kind of extracting method of Chinese compound word and extraction system
CN102135814A (en) Word input method and system
CN102253929A (en) Method and device for prompting user to input characters
CN111625621B (en) Document retrieval method and device, electronic equipment and storage medium
CN110390106A (en) Semantic disambiguation method, device, equipment and storage medium based on bi-directional association
CN111090734B (en) Method and system for optimizing machine reading understanding capability based on hierarchical attention mechanism

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant