CN101420313A - Method and system for clustering customer terminal user group - Google Patents

Method and system for clustering customer terminal user group Download PDF

Info

Publication number
CN101420313A
CN101420313A CNA2007101761781A CN200710176178A CN101420313A CN 101420313 A CN101420313 A CN 101420313A CN A2007101761781 A CNA2007101761781 A CN A2007101761781A CN 200710176178 A CN200710176178 A CN 200710176178A CN 101420313 A CN101420313 A CN 101420313A
Authority
CN
China
Prior art keywords
user
information
thesaurus
input method
attribute information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CNA2007101761781A
Other languages
Chinese (zh)
Other versions
CN101420313B (en
Inventor
苏雪峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Sogou Technology Development Co Ltd
Original Assignee
Beijing Sogou Technology Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Sogou Technology Development Co Ltd filed Critical Beijing Sogou Technology Development Co Ltd
Priority to CN2007101761781A priority Critical patent/CN101420313B/en
Publication of CN101420313A publication Critical patent/CN101420313A/en
Application granted granted Critical
Publication of CN101420313B publication Critical patent/CN101420313B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention provides a client user groups clustering method comprising the steps of collecting multiple user word banks from input method client users, recoding the corresponding relationship between a user and the corresponding user word bank, wherein, the word bank comprises words and word frequency; calculating to obtain characteristic parameters of a user according to the word bank of the user; calculating similarity degree between the user characteristic parameters to complete the clustering of user groups. Furthermore, the present also comprises a step of providing personalized information service to a user according to class information of the user. According to the present invention, class identification of users is completed by collecting input information (the input information includes input content and input habit) of multiple users and through a clustering strategy with a higher accuracy, and the clustering accuracy of input method client user groups are improved, therefore, personalized information services for users can be provided and the accuracy can be assured.

Description

A kind of method and system that carries out cluster at client user group
Technical field
The present invention relates to the internet information process field, particularly relate to and a kind ofly carry out the method and system of cluster and a kind of method and system that individual info service is provided to the user based on above-mentioned cluster result at client user group.
Background technology
Providing the information service of various personalizations to the user is next directions of internet information technical development, for example, and personalized search, the personalization issue (for example, news information, entertainment information, advertising message) of relevant information etc.But for providing of individual info service is provided, collection user's that must be a large amount of individual information just, and analyzing and processing are in addition found out this user's classification information, and then the individual info service of respective classes are provided at this user.
Yet, in numerous technology that tradition realizes,, make that so follow-up user group's analysis result deviation of carrying out is bigger because method and resource limit for the collection of user personality information, exist inaccurate, the incomplete problem of the information of obtaining.
For example, the data source that generally is used to collect user personality information comprises following three kinds:
(1) user's search history record
User's search history record can comprise: used query word record, the content of choosing the Search Results of clicking, clicked document and classified information or the like.Can not accurate description individual subscriber interest but these information all exist, and then can't accurately discern the problem of class of subscriber.Reason is as follows:
At first, the purpose of general user's search is often for the correct option of finding a problem or the relevant information of seeking some things; And the obtaining of this unknown message, the hobby place that reflects the user that can not be complete can only finish the part of user profile to collect, and can only reflect this user's personal attribute from certain angle.
Secondly, because name information or summary info in the search result list, can not accurately reflect the content of this Search Results, greatly belong to the invalid clicks behavior, so make can the reducing greatly of this data with reference to property so have in the user's click behavior that causes next producing.
Therefore the user search historical record is analyzed and accurately to be discerned class of subscriber.
(2) information that in browsing page, shows of user
This type of information mainly comprises the information such as webpage that the user browsed, but the same existence of this type of data source can not accurately reflect individual subscriber interest, and then can't accurately discern the problem of class of subscriber.
Because, at first, the user surf the web, in the process of browsing page, be easy to be subjected to the spin guiding of portal website, some information that the user often often browses all are hot topic, the focus that the website is promoted mainly, this class news more is the behavior that has reflected a kind of masses, rather than user's personal interest.Secondly, in the webpage that the user browses, often comprise the interactive content that participates in of netizens such as forum, community, blog, and this class content has been represented other users, other netizens' viewpoint probably, and the personal attribute that can not reflect this user (for example, browses among the user of same blog articles, has the people to approve of very much, someone opposes very much), bring noise jamming can on the contrary the collection of user personality information.Moreover user search all is easy to be subjected to interference nested in the webpage, pop-up advertisement with the information of browsing, and the user profile of Huo Deing is more inaccurate like this.
(3) personal information registered on the net of user
Because under existing network environment, the network user seldom can stay the real information of oneself on the net for the consideration of fail safe and privacy.The personal information that the user registers on the net often all is that falseness is fabricated out.So user's registration information is limited for the correct identification meaning of class of subscriber.
Therefore, present stage needs the urgent technical problem that solves of those skilled in the art to be exactly: provide personalized information service in order to be embodied as the user, proposition that how can novelty is a kind of collects the user personality information and the method for analyzing and processing in addition, realize more accurately class of subscriber identification, thus personalized degree and accuracy when improving user oriented individual info service being provided.
Summary of the invention
Technical problem to be solved by this invention provides a kind of method and system that carries out cluster at client user group, can collect user's input content and/or input habit, obtain accurate, abundant more user personality information, thereby can realize class of subscriber identification more accurately.
Accordingly, the present invention also provides a kind of input method system, is used to realize the collection of user personality input content and/or input habit, even the providing of individual info service.
Accordingly, the present invention also provides a kind of and provides the method and system of individual info service based on above-mentioned cluster result to the user, can accurately carry out on the base of recognition class of subscriber, realizes being rich in the providing of individual info service of efficient and accuracy.
In order to address the above problem, the invention discloses a kind of method of carrying out cluster at client user group, comprising: collect a plurality of input method client users' user thesaurus, and the corresponding relation of recording user and its user thesaurus; Described user thesaurus comprises speech and word frequency; At each user's user thesaurus, calculate this user's characteristic parameter; Calculate the similarity between each user characteristics parameter, finish cluster each user.
Preferably, described user thesaurus also comprises: application software and use information thereof; And/or, the binary or the N metamessage of incidence relation between the sign words; And/or, user's registration information; And/or, according to importing the recessive attribute information of user that the historical record analysis obtains; And/or, the recessive attribute information of the user who obtains according to the user's registration information analysis.
Preferably, described user characteristics CALCULATION OF PARAMETERS process further comprises: carry out pretreated step at user thesaurus.Wherein, described pre-treatment step can comprise: directly handle obtaining required customer attribute information at the information in the user thesaurus; Perhaps, obtain other information, handle obtaining required customer attribute information with user thesaurus from input method client.
According to another embodiment of the present invention, a kind of system that carries out cluster at client user group is also disclosed, comprising:
The dictionary memory module is used to compile a plurality of input method client users' the user thesaurus and the corresponding relation of recording user and its user thesaurus; Described user thesaurus comprises speech and word frequency;
The calculation of characteristic parameters module is used for the user thesaurus at each user, calculates this user's characteristic parameter;
The cluster module is used to calculate the similarity between each user characteristics parameter, finishes the cluster to each user.
Preferably, described user thesaurus can also comprise: application software and use information thereof; And/or, the binary or the N metamessage of incidence relation between the sign words; And/or, user's registration information; And/or, according to importing the recessive attribute information of user that the historical record analysis obtains; And/or, the recessive attribute information of the user who obtains according to the user's registration information analysis.
Preferably, described calculation of characteristic parameters module further comprises the preliminary treatment submodule, is used for carrying out preliminary treatment at user thesaurus.Wherein, described preprocessing process can comprise: directly handle obtaining required customer attribute information at the information in the user thesaurus; Perhaps, obtain other information, handle obtaining required customer attribute information with user thesaurus from input method client.
According to another embodiment of the present invention, a kind of method that provides individual info service at client user group is also disclosed, comprising: collect a plurality of input method client users' user thesaurus, and the corresponding relation of recording user and its user thesaurus; Described user thesaurus comprises speech and word frequency; At each user's user thesaurus, calculate this user's characteristic parameter; Calculate the similarity between each user characteristics parameter, finish cluster and record each user; Classification information according to a user provides individual info service to this user.
Preferably, described user thesaurus can also comprise: application software and use information thereof; And/or, the binary or the N metamessage of incidence relation between the sign words; And/or, user's registration information; And/or, according to importing the recessive attribute information of user that the historical record analysis obtains; And/or, the recessive attribute information of the user who obtains according to the user's registration information analysis.
Preferably, described user characteristics CALCULATION OF PARAMETERS process further comprises: carry out pretreated step at user thesaurus.Wherein, described pre-treatment step can comprise: directly handle obtaining required customer attribute information at the information in the user thesaurus; Perhaps, obtain other information, handle obtaining required customer attribute information with user thesaurus from input method client.
Preferably, described individual info service can comprise: recommend the affiliated relevant auxiliary lexicon of classification with the active user; And/or, the personalized search result is provided; And/or, recommend the affiliated relevant customizing messages of classification with the active user.
When described individual info service comprises that when providing personalized search as a result, described personalized search result can comprise: personalization results ordering and/or result at this user filter; And/or, at the information search result of this user's other types; And/or, recommend at this user's relevant search key element.
According to another embodiment of the present invention, a kind of system that provides individual info service at client user group is also disclosed, comprising:
The dictionary memory module is used to compile a plurality of input method client users' the user thesaurus and the corresponding relation of recording user and its user thesaurus; Described user thesaurus comprises speech and word frequency;
The calculation of characteristic parameters module is used for the user thesaurus at each user, calculates this user's characteristic parameter;
The cluster module is used to calculate the similarity between each user characteristics parameter, finishes cluster and record to each user;
The classification information application module is used for the classification information according to a user, provides individual info service to this user.
Preferably, described user thesaurus can also comprise: application software and use information thereof; And/or, the binary or the N metamessage of incidence relation between the sign words; And/or, user's registration information; And/or, according to importing the recessive attribute information of user that the historical record analysis obtains; And/or, the recessive attribute information of the user who obtains according to the user's registration information analysis.
Preferably, described calculation of characteristic parameters module further comprises: the preliminary treatment submodule is used for carrying out preliminary treatment at user thesaurus.Wherein, described preprocessing process can comprise: directly handle obtaining required customer attribute information at the information in the user thesaurus; Perhaps, obtain other information, handle obtaining required customer attribute information with user thesaurus from input method client.
Preferably, described individual info service can comprise: recommend the affiliated relevant auxiliary lexicon of classification with the active user; And/or, the personalized search result is provided; And/or, recommend the affiliated relevant customizing messages of classification with the active user.
When described individual info service comprises that when providing personalized search as a result, described personalized search result can comprise: personalization results ordering and/or result at this user filter; And/or, at the information search result of this user's other types; And/or, recommend at this user's relevant search key element.
According to another embodiment of the present invention, a kind of input method system is also disclosed, comprise input interface unit, dictionary and coupling display unit, described input method system can also comprise:
Record cell is used to write down this user's input information; Described input information comprises speech and word frequency;
Pretreatment unit is used for analyzing at this user's input information, obtains this user's recessive attribute information;
User thesaurus makes up module, is used to generate user thesaurus, and described user thesaurus comprises speech and word frequency, this user's recessive attribute information;
Communication unit is used to transmit this user ID and described user thesaurus to server end.
Preferably, described user thesaurus can also comprise: application software and use information thereof; And/or, user's registration information.
Preferably, described input method system can also comprise:
The user category information memory cell is used to obtain the active user's that server end obtains according to a plurality of user thesaurus analyses classification information and storage;
The classification information applying unit is used for the classification information according to this user, provides individual info service to this user.
Preferably, described individual info service can comprise: recommend the affiliated relevant auxiliary lexicon of classification with the active user; And/or, the personalized search result is provided; And/or, recommend the affiliated relevant customizing messages of classification with the active user.
Compared with prior art, the present invention has the following advantages:
Because the user is using computer to carry out daily document office, the chat of surfing the web, in the process of Entertainment, all can be frequent pass through input method to the computer inputting word information, finish reciprocal process with computer.The original inputting word information of this user has been revealed user's hobby to a certain extent, the industry ownership, individual informations such as use habit, and three information sources that the information that this user initiatively imports is partly mentioned with respect to background technology, more accurate, complete reflection user's personal touch.
Therefore, the present invention extracts by record, analysis to user's input information, can obtain user's individual information accurately.The present invention is with respect to traditional means, owing to be user's active input, non-passive acceptance, thereby collected individual information is more accurate, complete.
And then compile a plurality of users' input information (comprise input content and/or input habit), finish the identification of user's classification by the higher cluster strategy of accuracy, can improve accuracy to input method client customer group cluster, thereby can realize the providing of user personalized information service, and can guarantee suitable accuracy.
Description of drawings
Fig. 1 is a kind of flow chart of steps of carrying out the method embodiment of cluster at client user group of the present invention;
Fig. 2 is a kind of structured flowchart that carries out the system embodiment of cluster at client user group of the present invention;
Fig. 3 is a kind of flow chart of steps that the method embodiment of individual info service is provided at client user group of the present invention;
Fig. 4 is a kind of structured flowchart that the system embodiment of individual info service is provided at client user group of the present invention;
Fig. 5 is the structured flowchart of a kind of input method system embodiment of the present invention;
Fig. 6 is the structured flowchart of the another kind of input method system embodiment of the present invention.
Embodiment
For above-mentioned purpose of the present invention, feature and advantage can be become apparent more, the present invention is further detailed explanation below in conjunction with the drawings and specific embodiments.
The scheme that provides of the individual info service that the present invention realized can be used in numerous general or special purpose computingasystem environment or the configuration.For example: personal computer, server computer, handheld device or portable set, plate equipment, multicomputer system, the system based on microprocessor, network PC, minicom, mainframe computer, comprise distributed computing environment (DCE) of above any system or equipment or the like.
The present invention can describe in the general context of the computer executable instructions of being carried out by computer, for example program module.Usually, program module comprises the routine carrying out particular task or realize particular abstract, program, object, assembly, data structure or the like.Also can in distributed computing environment (DCE), put into practice the present invention, in these distributed computing environment (DCE), by by communication network connected teleprocessing equipment execute the task.In distributed computing environment (DCE), program module can be arranged in the local and remote computer-readable storage medium that comprises memory device.
One of core concept of the present invention is: the user is in the process of using input method, can accumulate the input historical record gradually, the original inputting word information of these users has reflected user's hobby to a certain extent, industry ownership, personal information such as use habit.Therefore, the present invention can classify to the user automatically according to user thesaurus, and the user is divided into different colonies; And the user of same customer group inside may have common hobby, more common language, similar diction or the like.And then, obtain after user's the community information, just can recommend colony's dictionary to the user, realize personalized information services such as personalized search, thereby better be user-friendly to.
With reference to Fig. 1, show a kind of method embodiment that carries out cluster at client user group of the present invention, can may further comprise the steps:
Step 101, a plurality of input method client users' of collection user thesaurus, and the corresponding relation of recording user and its user thesaurus; Described user thesaurus comprises speech and word frequency;
Step 102, at each user's user thesaurus, calculate this user's characteristic parameter;
Step 103, calculate the similarity between each user characteristics parameter, finish cluster each user.
Word frequency information one speech that is adopted in the present embodiment step 102 is an input method field vocabulary commonly used, and it also comprises the incoming frequency information of individual character except the incoming frequency information of expression vocabulary; Incoming frequency information wherein can be absolute value, also can be relative value, can also be other numerical value of the secondary indication incoming frequency crossed through certain strategy or algorithm process, for example, and weight sequencing information etc.
User thesaurus of the present invention can adopt various feasible storage forms, and for example tables of data or notepad or the like are general, all need to comprise speech and word frequency information.
The collection mode of described user thesaurus can for: input method client in real time or words that user thesaurus is sent to regularly collect in the computing equipment (preferred, described collection computing equipment exists with the form of server), promptly preferred, the input method computing equipment has a module that sends user thesaurus automatically.
The collection of described user thesaurus can also for: input method user regularly or the user thesaurus with oneself of not timing be sent to collecting terminal, be that described transmission is manually initiated by the user, for example, each user is sent to unified addresses of items of mail or unified server end realization collection with the personal word of oneself.
Moreover, (only offer user's input interface and display interface for input method in network, finish whole input process by the Connection Service device), the collection of its user thesaurus is just simpler, because the input method system itself that this moment, the user used is exactly a server, user thesaurus can directly be stored in server end.In fact, the present invention adopts and anyly can realize that the mode of information gathering all is feasible, enumerates explanation no longer one by one.
Need to prove that server end of the present invention is a logical concept, be not limited on the server of entity, because under the prior art condition, common terminal of computing device also might logically be carried out message transmission as server end, for example, and P2P technology etc.
During the corresponding relation of recording user and its user thesaurus, need user totem information in the step 101, general can adopt family registration ID as user totem information, perhaps with the ID of input method client as user totem information or the like, will not describe in detail at this.
What in fact step 103 was finished is exactly a cluster process, " cluster " speech wherein belongs to the technical term of this area, generally be meant: under the situation that does not have classification information under the sample, immanent structure according to the sample set data, a plurality of sample elements are merged into a plurality of set according to correlation, each set is called a class, and the element in each class should have certain general character (described general character can be controlled by parameter threshold).In same classification, the distance between the individuality is less, and the distance between the individuality on different classes of is bigger than normal.Described distance is used to represent similarity degree, and the more little then expression of distance is similar more, and for example, in the automatic cluster technology of webpage, general common distance function is represented the similarity degree between webpage.Since existing in fields such as artificial intelligence, data minings in a large number about the research of clustering algorithm, therefore, clustering algorithm itself is no longer described in detail at this.
According to the difference of the cluster strategy that presets, can obtain the class of subscriber (the big class of promptly more wide in range user or meticulousr user's group) of various granule sizes by step 103, to satisfy the demand of various subsequent applications.
The described user thesaurus of present embodiment except speech and word frequency, can also comprise: application software and use information thereof, for example, the access times of certain application program, service time roughly or the like.Because the operating position of software also can reflect user's individual information to a certain extent, for example often use the user of office office software, be likely a business people, in Search Results, more promote some commerce like this, financial information, for him, be a good individual demand; The user of frequent use excel may be a document clerical workforce; And often use the people of immediate communication tool, and then may be partial to the participation of interaction content more, comprise blog, community, forum or the like; Often the people of use music player may be interested in the content of fashion; Often use the user of web browser (browser), may be concerned about contents such as news, amusement, Eight Diagrams more.
In another preferred embodiment of the present invention, described user thesaurus can also comprise binary information.In fact described binary information is meant the annexation between the speech of expression text front and back, and (perhaps, Bigram), " binary " wherein refered in particular to the statistics of neighbouring relations in twos generally also can be called bigram statistics.For example, input information is " One is not a true man unless he comes to the Great Wall ", if we are with word during as the fractionation unit of minimum, we can split out " no " and " arrive " 7 individual characters in " length " " city " " non-" " good " " Chinese ", and binary wherein comprise " less than ", " to long ", " Great Wall ", " city is non-", " non-good ", " brave man ".Word frequency in the collected input information and binary combination relation can reflect that this user some vocabulary and language commonly used in daily input process uses style, thereby can obtain some individual informations of this user by analysis.
Certainly, the present invention does not limit and only collects binary information, in fact says from effect, and the relation information that can collect n unit (n 〉=2) is better, just is limited to the computing capability of present user terminal, and only collecting binary information is the scheme of a comparative optimization.
In another preferred embodiment of the present invention, described user thesaurus can also comprise user's registration information, for example, and information such as age, address, occupation; These information directly can be used as one or more characteristic dimension of user thesaurus, also can obtain this user's recessive attribute information according to presetting algorithm to these information analyses by input method client, the recessive attribute that will obtain again is as one or more characteristic dimension of user thesaurus.
In another preferred embodiment of the present invention, described user thesaurus can also comprise the recessive attribute information of user that obtains according to the analysis of input historical record.Described input historical record is exactly speech and word frequency briefly, further also can comprise binary information.Certainly, if under the situation that client devices computed ability and storage capacity allow, also whole input history text of recording user can more fully be analyzed this user's personalization attributes information.For the input historical record that only comprises speech and word frequency,, can obtain this user's a plurality of recessive attribute by the analysis of all angles.
For example, obtain belonging to of this user happy emotion tendency and still belong to passive emotion tendency by analyzing the emotion degree.Described emotion degree can obtain by the various distributions of vocabulary in text that have emotion that statistics presets, generally in text, noun often has different emotions, have energetic color as " double-edged sword ", " dark night " has oppressive color, or the like, can the emotion that text showed be considered by the distribution of these speech in the statistics text.As, for the poem text, " withered vine on an old tree dusk crow, small bridge over the flowing stream other " has the strong feelings color; One mentions " cuckoo " in the poem, just has a kind of atmosphere of sadness, as " hoping Supreme Being's desire for love holder cuckoo "; " military hardware " such speech then shows impassioned color; " willow ", " Lan Zhou " then can show graceful and restrained color.By adding up the emotion degree that the distribution situation of above-mentioned vocabulary in text just can identify each text, and then obtain this user's emotion tendency.
In like manner, also can obtain other recessive attributes of this user by the vocabulary attribute of other angles of statistics, as, user's classified information, diction etc.
Above user thesaurus of the present invention has been carried out detailed annotation and introduced, when the content that is comprised in the user thesaurus was more and more enriched, in fact Ci Shi user thesaurus had just become the user model that this user characterized from a plurality of characteristic dimension.In the forming process of superincumbent user thesaurus, may need the computing equipment of input method client finish certain data processing work (as, the calculating of customer attribute information is obtained or the like).And in fact, this part work also can be carried out by the server end of collecting user thesaurus.That is, described user characteristics CALCULATION OF PARAMETERS process further comprises: carry out pretreated step at user thesaurus.
Described pretreated process can be divided into two kinds of situations: the one, and directly handle and obtain required characteristic dimension at the information in the user thesaurus, for example, obtain the recessive attribute information of user according to the analysis of input historical record; The 2nd,, obtain more other information (as, user's registration information etc.) from input method client, handle obtaining required characteristic dimension with user thesaurus.
That is, in the present invention, input method client can only be gathered raw information, and is sent to server end, and server end calculates the characteristic parameter towards a plurality of dimensions after these raw informations are carried out preliminary treatment again; Also can and carry out certain preliminary treatment by the input method client collection, the user thesaurus that comprises customer attribute information of gained is sent to server end, server end directly gets final product at the characteristic parameter that this user thesaurus calculates a plurality of dimensions.
The detailed process of carrying out cluster at user thesaurus is described in detail below.
In this example, the master data of user thesaurus is the corresponding relation that speech arrives word frequency, secondly also comprises software and uses the recessive attribute information of users such as information, user's classified information, user language style, user feeling factor.
Table 1 has been described a collected user original user dictionary.
Table 1
Figure A200710176178D00161
Then these information are carried out abstract, the discrete form that turns to characteristic vector, table 2 has been described the characteristic vector form after abstract.For example, adopt following coding mapping:
(1) speech-termid mapping: Sohu-t11 Zhou Jielun-t18
(2) software-software id mapping: Word-〉t21msn-〉t23
(3) classification-classification id mapping: amusement-t31 physical culture-t32
(4) diction-diction id: swordsman-t41 describing love affairs-t42
(5) user feeling-emotion id: happy-t51 passiveness-t52
Finish like this after the coding mapping, original word frequency, software number of times, class indication, style sign, emotional factor need be converted to the weights mark.For example, the form after the conversion is:
Table 2
Figure A200710176178D00162
The user can be expressed as like this: (W11, w12, w13..., w21, w22, w23..., w31, w32, w33..., w41, w42, w33 ..., w51, w52, w53...)
Wherein, weight need adopt no method for normalizing according to the type of data:
W1x series, expression word frequency information can adopt field of statistics kind tf﹠amp commonly used; The idf method is carried out normalized.Wherein tf represents the occurrence number of term in the document (vocabulary), and occurrence number is many more in its expression document, and the weight after the normalization is big more; Idf represents the inverse of the total degree that term occurs in a language material set, occurrence number is big more in the language material set, and idf is more little, and the weight after the normalization is more little.For example, " Beijing " is a word more common than " Sohu ", thereby the idf in " Beijing " is littler than " Sohu ".
W2x series, expression software uses information, can directly adopt the software access times to represent;
W3x series, presentation class information represents with 0,1 whether the user belongs to this classification;
W4x series, the representation language style information represents with 0,1 whether the user has such diction;
W5x series, expression user feeling information represents with 0,1 whether the user belongs to this kind emotion tendency.
After the characterization that obtains the user is represented, can adopt clustering method to its cluster, the method for cluster can system is divided into disintegrating method, stratification, based on the method for density, based on the method for grid, based on method of model or the like.By the adjustment of cluster granularity, can reach different cluster requirements.
Be example with the representative k-means algorithm in the disintegrating method below, describe the workflow of clustering algorithm, the k-means algorithm is the dynamic clustering method of a kind of typical case based on similitude degree between sample, belongs to the unsupervised learning method.
The algorithm operating procedure is as follows:
Input: cluster number k, n data object.
Output: k the cluster that satisfies the variance minimum sandards.
Handling process:
(1) select k object as initial cluster center arbitrarily from n data object;
(2) circulation (3) to (4) till each cluster no longer changes;
(3), calculate the distance of each object and these center object according to the average (center object) of each cluster object; And again corresponding object is divided according to minimum range;
(4) recomputate the average (center object) of each (changing) cluster
The course of work of k-means algorithm is described as follows: at first select k object as initial cluster center arbitrarily from n data object; And, then, respectively they are distributed to (cluster centre representative) cluster the most similar to it according to the similarity (distance) of they and these cluster centres for other object of be left; And then calculate the cluster centre (average of all objects in this cluster) of each new cluster that obtains; Constantly repeat this process till the canonical measure function begins convergence.Generally all adopt mean square deviation as the canonical measure function.K cluster has following characteristics: each cluster itself is compact as much as possible, and separates as much as possible between each cluster.
Basic operation in the cluster process is to need relatively whether two objects are similar, and what adopt usually here is similarity method relatively.At first, utilize certain calculating formula of similarity to obtain the similarity of object in twos, big more expression two objects of similarity are similar more, and more little expression two objects of similarity are dissimilar more.Like this, can think similarity greater than certain threshold value in twos to as if similar.
Be provided with n user thesaurus, each user thesaurus comprises p item termid, and the characterization vector matrix is
Figure A200710176178D00181
X wherein Ij(i=1 ..., n; J=1 ..., p) be the weight of j term of i user thesaurus.I dictionary X iCharacteristic vector be capable description of i of matrix X, so two user thesaurus X KWith X LBetween similitude, the similarity degree that can be by the K among the matrix X capable and L be capable calculates.
Distance commonly used has much with similar formula, for example:
Euclidean distance: Euclidean distance is more little, and two objects are similar more
d ij ( 2 ) = ( Σ a = 1 p ( x ia - x ja ) 2 ) 1 / 2
Cosine formula: cosine value is big more, and two objects are similar more
cos θ ij = Σ a = 1 p x ia x ja Σ a = 1 p x ia 2 · Σ a = 1 p x ja 2 1≤cosθ ij≤1
Here cite an actual example at last, cluster process is further described.
The first step is collected the original user dictionary that the user uploads
Xiao Zhang is that the automobile friend of Sohu understands the member, can comprise more words relevant with automobile in his dictionary, and is as shown in table 3.
Table 3
Xiao Li is the member of the automobile friend of Sina meeting, and his dictionary also can comprise some words relevant with automobile, and is as shown in table 4.
Table 4
Figure A200710176178D00192
Xiao Wang is a network writer, is not very interested for automobile, and his more care Eight Diagrams information are as shown in table 5.
Table 5
Figure A200710176178D00193
Second step, the original dictionary of the user who uploads is carried out characterization, as shown in table 6.
T11-〉T12-of Sohu〉automobile friend understands T13-〉Reynolds T14-〉Zhou Jielun T15-〉Liu Yifei
T21->Word T22->msn
T31-〉amusement T32-〉physical culture
T41-〉swordsman T42-〉describing love affairs
T51-〉happy T52-〉passiveness
Table 6
T11 T12 T13 T14 T15 T21 T22 T31 T32 T41 T42 T51 T52
Xiao Zhang 8.5 7.3 0 0 0 1 2 0 0 0 0 1 0
Xiao Li 0 7.3 7.5 0 0 2 2 0 1 0 0 1 0
Xiao Wang 0 0 0 7.5 7.2 2 2 1 0 0 1 1 1
Like this, obtain three users' characteristic vector:
X1 (Xiao Zhang) (8.5,7.3,0,0,0,1,2,0,0,0,0,1,0);
X2 (Xiao Li) (0,7.3,7.5,0,0,2,2,0,1,0,0,1,0);
X3 (Xiao Wang) (0,0,0,7.5,7.2,2,2,1,0,0,1,1,1);
In the 3rd step, cluster is an example with the Euclidean distance
Xiao Zhang and Xiao Li's distance is D12=sqrt[(8.5-0) * (8.5-0)+(7.3-7.3) * (7.3-7.3)+(0-7.5) * (0-7.5)+(0-0) * (0-0)+(0-0) * (0-0)+(1-2) * (1-2)+(2-2) * (2-2)+(0-0) * (0-0)+(0-1) * (0-1)+(0-0) * (0-0)+(0-0) * (0-0)+(1-1) * (1-1)+(0-0) * (0-0)]=11.4
Xiao Zhang and Xiao Wang's distance is D13=sqrt[(8.5-0) * (8.5-0)+(7.3-0) * (7.3-0)+(0-0) * (0-0)+(0-7.5) * (0-7.5)+(0-7.2) * (0-7.2)+(1-2) * (1-2)+(2-2) * (2-2)+(0-1) * (0-1)+(0-0) * (0-0)+(0-0) * (0-0)+(0-1) * (0-1)+(1-1) * (1-1)+(0-1) * (0-1)]=15.4
This shows that Xiao Zhang and Xiao Li's distance is littler, two people are more similar.In this way, can judge similar crowd, and utilize clustering algorithm that it is collected in together.
Referring to Fig. 2, show a kind of system embodiment of carrying out cluster at client user group, comprising:
Dictionary memory module 201 is used to compile a plurality of input method client users' the user thesaurus and the corresponding relation of recording user and its user thesaurus; Described user thesaurus comprises speech and word frequency;
Calculation of characteristic parameters module 202 is used for the user thesaurus at each user, calculates this user's characteristic parameter;
Cluster module 203 is used to calculate the similarity between each user characteristics parameter, finishes the cluster to each user.
In another preferred embodiment of the present invention, described user thesaurus can also comprise: application software and use information thereof; And/or, the binary or the N metamessage of incidence relation between the sign words; And/or, user's registration information; And/or, according to importing the recessive attribute information of user that the historical record analysis obtains; And/or, the recessive attribute information of the user who obtains according to the user's registration information analysis.Be that user thesaurus can comprise above-mentioned any one additional information, also can comprise the combination of above-mentioned any additional information, certainly, can also comprise the additional information of other types, this specification can't relate to one by one.
When input method client only is used to provide raw information, then described calculation of characteristic parameters module 202 further comprises the preliminary treatment submodule, is used for carrying out preliminary treatment at user thesaurus, is used further to after the preliminary treatment calculate obtain its characteristic parameter.For example, directly handle the recessive attribute information that obtains the user, and then calculate required characteristic parameter at the raw information in the user thesaurus; Perhaps, obtain more other information (as, user's registration information etc.), handle obtaining required characteristic parameter with user thesaurus from input method client.
With reference to Fig. 3, show a kind of method embodiment that provides individual info service at client user group, can may further comprise the steps:
Step 301, a plurality of input method client users' of collection user thesaurus, and the corresponding relation of recording user and its user thesaurus; Described user thesaurus comprises speech and word frequency;
Step 302, at each user's user thesaurus, calculate this user's characteristic parameter;
Step 303, calculate the similarity between each user characteristics parameter, finish cluster and record each user;
Step 304, foundation one user's classification information provides individual info service to this user.
In order to obtain user clustering result more accurately, described user thesaurus can also comprise: application software and use information thereof; And/or, the binary or the N metamessage of incidence relation between the sign speech; And/or, user's registration information; And/or, according to importing the recessive attribute information of user that the historical record analysis obtains; And/or, the recessive attribute information of the user who obtains according to the user's registration information analysis.Be that user thesaurus can comprise above-mentioned any one additional information, also can comprise the combination of above-mentioned any additional information.
When input method client only is used to provide raw information, then step 302 further can comprise at user thesaurus and carries out pretreated step, is used further to after the preliminary treatment calculate obtain its characteristic parameter.For example, directly handle and obtain required characteristic dimension at the speech in the user thesaurus and word frequency information; Perhaps, obtain more other information (as, user's registration information etc.), handle obtaining required characteristic dimension with user thesaurus from input method client.
Because more detailed cluster process is described in this specification front, therefore, this part is no longer described in detail, providing of individual info service is provided below is illustrated.In fact, because the type of information service is very many, the present invention does not need this is limited, so long as the individual info service that is provided on the basis that utilizes previous embodiment to user clustering all belongs within the core idea of the present invention.
General, described individual info service can comprise: recommend the affiliated relevant auxiliary lexicon of classification with the active user.Described auxiliary lexicon can be to adopt manual type or feasible prior art to handle to obtain, in the time can judging that certain active user belongs to a certain classification according to cluster result of the present invention, then can recommend the auxiliary lexicon relevant to give this user, to improve this user's input efficiency with this classification.
Because auxiliary lexicon has write down the vocabulary input habit of a plurality of users in this classification, even so a lot of vocabulary are arranged, the active user did not once import, and can obtain from auxiliary lexicon yet, to improve this user's input efficiency, import the efficient of speech especially first.
Auxiliary lexicon of the present invention can comprise various dictionaries, for example, and specialized dictionary.Specialized dictionary is based on the difference of the used words of the user of different professional domains, words collocation relation, word frequency information and/or sentence structure, for the corresponding specialized dictionary of all types of user customization, generally can dictionary be divided into medicine dictionary, electrical type dictionary, IT class dictionary or the like according to ambit.Certainly, the user also can oneself make as required, edit and use.Number of patent application is 200710099474.6, and name is called the cell dictionary of mentioning in the Chinese patent application file of " a kind of method of method, input method system and Word library updating of character input " and can be used as another kind of feasible auxiliary lexicon.Described cell dictionary, concrete implication is the dictionary with a certain general character that uses of a certain special group, a certain individual or some people (is in each cell dictionary words have a predicable at least), for example: everyone thesaurus of recent film dictionary, up-to-date title of the song dictionary, World of Warcraft's dictionary, biology dictionary, Tsing-Hua University, all people's thesaurus of so-and-so company, ground, Haidian District thesaurus etc.The cell dictionary can provide the user to create, edit, retrieve, download by cell dictionary website, and then realizes higher personalization.
Can comprise entry information in the auxiliary lexicon, also can comprise the word frequency or the word order information of entry.The word frequency information representation be the possibility that the user uses this entry, its relative size can be represented word order.Word order information is used for expressing the relative importance of entry, can react usually to be the sorting position of entry in candidate item.Under some situation, also can directly specify the position (perhaps position range) of certain entry in candidate item in the auxiliary lexicon.For the Chinese pinyin input method, the Pinyin information that the entry in the auxiliary lexicon is common and corresponding carries out related.But also can be directly with alphabetical sequence carry out related, " the self-defined phrase " in the search dog spelling input method for example.
Further, the auxiliary lexicon relevant with this classification can also comprise the dictionary of similar colony except the auxiliary lexicon of classification under the active user, for example, certain user may belong to Super Girl colony, then can recommend its super man's auxiliary lexicon (as, comprise the auxiliary lexicon of super male player's name entry).
In other cases, described individual info service can comprise: recommend the customizing messages relevant with the affiliated classification of active user etc., described relevant customizing messages can be wished the information of any personalization of transmitting to the user for Information Provider, for example: news information, entertainment information, advertising message, much-talked-about topic etc.; For example, advertisement, news, stock, related article, dependent merchandise or the like.These information can adopt various published methods, for example any one or a plurality of combination etc. in picture, text, audio frequency, video and these elements.
The published method of described customizing messages can adopt following variety of way: to the Virtual Space issue relevant information of terminal use in network; Described Virtual Space comprises personal website, blog space or E-mail address etc.; Perhaps, on terminal use's computing equipment, issue (certainly, institute's information releasing can be obtained from server end) by input method client.
The present invention does not need to be limited to the variety of way of showing, preferably, can carry out the displaying of relevant information at local computing device by browser window, for example pop-up advertisement, float advertisement, the advertisement of being fade-in fade-out, portraitlandscape push-and-pull advertisements etc., these display techniques all are known in the art.Certainly, also can adopt the mode of various desktop plug-in units, carry out the displaying of relevant information in the desktop optional position of computing equipment, for example, in delegation, row or one jiao of displaying of carrying out relevant information of desktop.
Further expand, also can represent relevant information by all places of input method platform itself, for example in input method candidate word window, status bar or its neighboring area show.For example, can also be by the input method platform outward appearance---" skin " shows certain relevant information, differences such as the color by " skin ", pattern, type are showed different relevant informations." skin " that is the input method platform not only can set up the pattern of being liked on their own by the user, the difference of the relevant information that can also show as required and automatically adjusting to strengthen the effect of relative information displaying.
Also can load the chained address in the described relevant information, and those skilled in the art are easy to according to user or commercial demand various methods of exhibiting be improved, so that better meet the relevant information issue that does not influence user experience.
In other cases, described individual info service also can comprise: the personalized search result is provided.For personalized Search Results is provided, a kind of mode is at input method client integration search interface, is linked to the search engine of far-end, and another kind of mode is the shared same user identification system of input method client and search engine.
Wherein related search engine (is example with the web search) can be understood as: the information with on the certain strategy collection the Internet after information being organized and handled, provides the system of retrieval service for the user.For example, search engine provides a page that comprises search box, at search box input word, submit to search engine by browser after, search engine will return the content-related information tabulation with user's input.The personalized search of the desired realization of the present invention, can be understood as: search engine is according to individual informations such as user's identity, hobby, use habits, method at different user coordinate indexing result more is provided, personalized search can be so that Search Results be more accurate, reduce the user search time, satisfy the user search demand more.
Resulting personalized search result can be various forms, promptly can set various adjustment strategies.Below some personalized search result formats that may adopt are carried out simple declaration:
(1), described personalized search result comprises that personalized ordering and/or result at this user filter.Promptly according to this user's individual information, with the most suitable this user's information sorting preceding.Certainly, further, can also delete the Search Results that some do not meet this user personality information, perhaps Search Results be assembled or concluded or the like.
(2), described personalized search result comprises the information search result at this user's other types.
For example, learn that by user model this user's picture attribute and music attribute are very high, then in the personalized search result except the Search Results that general webpage is provided, also insert some relevant picture search result and music searching results.That is to say, though the only search carried out of the Webpage search interface by search engine of this user, but can return the Search Results of other types, for example, picture search result or music searching result or the like.Promptly to a certain extent, realized comprehensive search interface, avoided this user to finish the trouble of classified search at this user.
(3), described personalized search result comprises the relevant search key element recommendation at this user.
For example, suppose that Xiao Zhang's game attributes is higher, when query word is " the semi-gods and the semi-devils ", can show " the semi-gods and the semi-devils download ", relevant search keywords such as " the semi-gods and the semi-devils attack strategys "; Xiao Liu is that the literature attribute is higher, when his search " the semi-gods and the semi-devils ", then can show " the semi-gods and the semi-devils novel ", relevant search keywords such as " downloads of the semi-gods and the semi-devils electronic edition ".
Certainly, relevant search key element described here not merely comprises the keyword of textual form, also comprises picture, video or other forms of various relevant search key element.
Only listed the embodiment mode of several Search Results personalizations above, those skilled in the art can also set other feasible modes according to actual needs.
General, can be by generic search results adjustment being obtained personalized Search Results, for example, the server end of search engine need not to improve, and still unification provides generic search results, by input method client this result is adjusted then.Perhaps both finished providing of generic search results, also finished personalized search result's adjustment process by server end.
In another embodiment of the present invention, also can be by obtain personalized Search Results at search procedure adjustment, as being suitable for distinctive search strategy etc.For example, when " women " attribute of learning this user is strong, information that then can some women of Direct Filtration seldom pays close attention in search procedure.The technology of filtering realizes the mode that can adopt theme to filter, and for example, when the theme of certain web document does not meet prerequisite, then can skip this webpage, no longer this webpage is carried out detailed search; The technology of filtering realizes adopting judges whether some property value in the information model meets the mode of prerequisite, for example, the property value of certain document does not meet prerequisite in the described information model, then directly skips the document, no longer the document is carried out detailed search.The process of above-mentioned information filtering can avoid further dividing word and search, conserve system resources to a certain extent for certain information.
With reference to Fig. 4, show a kind of system embodiment that provides individual info service at client user group, specifically can comprise with lower member:
Dictionary memory module 401 is used to compile a plurality of input method client users' the user thesaurus and the corresponding relation of recording user and its user thesaurus; Described user thesaurus comprises speech and word frequency;
Calculation of characteristic parameters module 402 is used for the user thesaurus at each user, calculates this user's characteristic parameter;
Cluster module 403 is used to calculate the similarity between each user characteristics parameter, finishes cluster and record to each user;
Classification information application module 404 is used for the classification information according to a user, provides individual info service to this user.
Described individual info service can comprise: recommend the affiliated relevant auxiliary lexicon of classification with the active user; And/or, the personalized search result is provided; And/or, recommend customizing messages relevant or the like with the affiliated classification of active user.
Be example below with the search, to how applicating category information application module 404 provides individual info service simply to introduce to the user.
As previously mentioned, one of implementation that individual info service is provided to the user is: classification information application module 404 is positioned at the server end of search engine, be integrated with searching interface in input method client, be linked to the search engine of far-end, show the personalized search result by input method client or web browser.Certainly, classification information application module 404 also can be positioned at input method client, for example, finishes the personalization adjustment to generic search results.Described personalized search result can comprise: personalization results ordering and/or result at this user filter; And/or, at the information search result of this user's other types; And/or, recommend at this user's relevant search key element.
In another preferred embodiment of the present invention, the shared same user identification system of input method client and search engine, then can obtain to provide for the user on wider scope the information service of personalization, the search engine category links information application module 404 of any shared user ID just can be realized providing of individual info service.
When classification information application module 404 be used for to the user recommend with the active user under during the relevant auxiliary lexicon of classification, then classification information application module 404 can be positioned at server end, obtains corresponding auxiliary lexicon by inquiry in the auxiliary lexicon set and is sent to input method client.For the recommendation of relevant customizing messages, also can finish by the way, be not described in detail in this.
Need to prove, classification information application module 404 also can be positioned at input method client, by classification under the active user being sent to corresponding each application server, thereby obtain various individual info services at the active user, for example, require to recommend corresponding auxiliary lexicon to the auxiliary lexicon application server, perhaps require to recommend corresponding specific information to information delivery server.
In a preferred embodiment of the invention, described user thesaurus can also comprise: application software and use information thereof; And/or, the binary or the N metamessage of incidence relation between the sign words; And/or, user's registration information; And/or, according to importing the recessive attribute information of user that the historical record analysis obtains; And/or, the recessive attribute information of the user who obtains according to the user's registration information analysis.
In another embodiment of the present invention, described calculation of characteristic parameters module 402 further comprises: the preliminary treatment submodule is used for carrying out preliminary treatment at user thesaurus.This scheme is applicable to that the user thesaurus of input method client only provides some raw informations, and does not further analyze the situation that obtains customer attribute information.Described preprocessing process can for: directly handle and obtain required customer attribute information at the information in the user thesaurus; Perhaps, obtain other information, handle obtaining required customer attribute information with user thesaurus from input method client.
For example, user Xiao Wang, Xiao Li, Xiao Zhang are that the automobile fan organizes AA member, everyone has unique, a personalized id in this fan's circle inside, Citreen L1 for example, the ME of Cherry etc., everybody can use these personalized id address the other side (promptly frequent these vocabulary of input) in the forum that is everlasting pours water; And outside this circle, few people know the implication of these id, more not use that can be frequent.So from the usage frequency angle of some particular words, the present invention can come distinguishing among the crowd of this automobile fan group and other masses, and the member of AA is parsed into a colony automatically.Like this, just can provide following individual info service for each member in this customer group:
(1), recommends to use for this group member at the auxiliary lexicon of this colony;
(2), realize personalized search: for example, can in Search Results, mix more and automobile, result that engine is relevant, the advertisement of distribution automobile etc.;
(3), to this intragroup user, recommend the hot issue relevant, hot news, sales promotion information etc. with automobile.
With reference to Fig. 5, show a kind of embodiment of input method system, comprise input interface unit 501, system's dictionary 502 and coupling display unit 503, it is characterized in that described input method system also comprises:
Record cell 504 is used to write down this user's input information; Described input information comprises speech and word frequency; Described input information can also comprise binary information; Preferably, also can comprise the N metamessage;
Pretreatment unit 505 is used for analyzing at this user's input information, obtains this user's recessive attribute information;
User thesaurus construction unit 506 is used to generate user thesaurus, and described user thesaurus comprises speech and word frequency, this user's recessive attribute information;
Communication unit 507 is used to transmit this user ID and described user thesaurus to server end.
Input method system embodiment shown in Figure 5 goes for various language, for example, Chinese, Japanese, Korean, English etc., because the application flow of the present invention in various spoken and written languages all is similar, so for convenience of description, only the present invention being applied in Chinese situation below describes.
The input mode that input method system embodiment shown in Figure 5 can adopt can comprise keyboard symbol, hand-written information and phonetic entry or the like, because the message switching mode in these input modes all belongs to known technology, has not just described in detail at this.This input method system can be applied in the multiple computing equipment, for example, and PC or mobile phone terminal.
User thesaurus construction unit 506 resulting user thesaurus can also comprise: application software and use information thereof; And/or, user's registration information or the like.Certainly, also these information can be independent of described user thesaurus, send it to server end, these information be handled in conjunction with the information of user thesaurus, obtain characteristic parameter at this user by server by communication unit 507.
With reference to Fig. 6, show the embodiment of another kind of input method system, comprise input interface unit 601, system's dictionary 602 and coupling display unit 603, it is characterized in that described input method system also comprises:
Record cell 604 is used to write down this user's input information; Described input information comprises speech and word frequency; Described input information can also comprise binary information;
Pretreatment unit 605 is used for analyzing at this user's input information, obtains this user's recessive attribute information;
User thesaurus construction unit 606 is used to generate user thesaurus, and described user thesaurus comprises speech and word frequency, this user's recessive attribute information;
Communication unit 607 is used to transmit this user ID and described user thesaurus to server end;
User category information memory cell 608 is used to obtain the active user's that server end obtains according to a plurality of user thesaurus analyses classification information and storage;
Classification information applying unit 609 is used for the classification information according to this user, provides individual info service to this user.
Wherein, described individual info service can comprise: recommend the affiliated relevant auxiliary lexicon of classification with the active user; And/or, the personalized search result is provided; And/or, recommend the affiliated relevant customizing messages of classification with the active user.Because this partial content describes in detail in front, no longer repeats at this.
Input method system shown in Figure 5 can be used to gather user's input information, and transfers to server end by the mode of user thesaurus, and input method system shown in Figure 6 can also be used for providing personalized information service to the user.
Each embodiment in this specification all adopts the mode of going forward one by one to describe, and what each embodiment stressed all is and the difference of other embodiment that identical similar part is mutually referring to getting final product between each embodiment.For system embodiment, because it is similar substantially to method embodiment, so description is fairly simple, relevant part gets final product referring to the part explanation of method embodiment.
More than to a kind of method and system that carries out cluster at client user group provided by the present invention, and a kind of method and system that individual info service is provided to the user based on above-mentioned cluster result, and a kind of input method system is described in detail, used specific case herein principle of the present invention and execution mode are set forth, the explanation of above embodiment just is used for helping to understand method of the present invention and core concept thereof; Simultaneously, for one of ordinary skill in the art, according to thought of the present invention, the part that all can change in specific embodiments and applications, in sum, this description should not be construed as limitation of the present invention.

Claims (24)

1, a kind ofly carry out the method for cluster, it is characterized in that, comprising at client user group:
Collect a plurality of input method client users' user thesaurus, and the corresponding relation of recording user and its user thesaurus; Described user thesaurus comprises speech and word frequency;
At each user's user thesaurus, calculate this user's characteristic parameter;
Calculate the similarity between each user characteristics parameter, finish cluster each user.
2, the method for claim 1 is characterized in that, described user thesaurus also comprises:
Application software and use information thereof;
And/or, the binary or the N metamessage of incidence relation between the sign words;
And/or, user's registration information;
And/or, according to importing the recessive attribute information of user that the historical record analysis obtains;
And/or, the recessive attribute information of the user who obtains according to the user's registration information analysis.
3, the method for claim 1 is characterized in that, described user characteristics CALCULATION OF PARAMETERS process further comprises: carry out pretreated step at user thesaurus.
4, method as claimed in claim 3 is characterized in that, described pre-treatment step comprises:
Directly handle and obtain required customer attribute information at the information in the user thesaurus;
Perhaps, obtain other information, handle obtaining required customer attribute information with user thesaurus from input method client.
5, a kind ofly carry out the system of cluster, it is characterized in that, comprising at client user group:
The dictionary memory module is used to compile a plurality of input method client users' the user thesaurus and the corresponding relation of recording user and its user thesaurus; Described user thesaurus comprises speech and word frequency;
The calculation of characteristic parameters module is used for the user thesaurus at each user, calculates this user's characteristic parameter;
The cluster module is used to calculate the similarity between each user characteristics parameter, finishes the cluster to each user.
6, system as claimed in claim 5 is characterized in that, described user thesaurus also comprises:
Application software and use information thereof;
And/or, the binary or the N metamessage of incidence relation between the sign words;
And/or, user's registration information;
And/or, according to importing the recessive attribute information of user that the historical record analysis obtains;
And/or, the recessive attribute information of the user who obtains according to the user's registration information analysis.
7, system as claimed in claim 5 is characterized in that, described calculation of characteristic parameters module further comprises the preliminary treatment submodule, is used for carrying out preliminary treatment at user thesaurus.
8, system as claimed in claim 7 is characterized in that, described preprocessing process comprises:
Directly handle and obtain required customer attribute information at the information in the user thesaurus;
Perhaps, obtain other information, handle obtaining required customer attribute information with user thesaurus from input method client.
9, a kind ofly provide the method for individual info service, it is characterized in that, comprising at client user group:
Collect a plurality of input method client users' user thesaurus, and the corresponding relation of recording user and its user thesaurus; Described user thesaurus comprises speech and word frequency;
At each user's user thesaurus, calculate this user's characteristic parameter;
Calculate the similarity between each user characteristics parameter, finish cluster and record each user;
Classification information according to a user provides individual info service to this user.
10, method as claimed in claim 9 is characterized in that, described user thesaurus also comprises:
Application software and use information thereof;
And/or, the binary or the N metamessage of incidence relation between the sign words;
And/or, user's registration information;
And/or, according to importing the recessive attribute information of user that the historical record analysis obtains;
And/or, the recessive attribute information of the user who obtains according to the user's registration information analysis.
11, method as claimed in claim 9 is characterized in that, described user characteristics CALCULATION OF PARAMETERS process further comprises: carry out pretreated step at user thesaurus.
12, method as claimed in claim 11 is characterized in that, described pre-treatment step comprises:
Directly handle and obtain required customer attribute information at the information in the user thesaurus;
Perhaps, obtain other information, handle obtaining required customer attribute information with user thesaurus from input method client.
13, method as claimed in claim 9 is characterized in that, described individual info service comprises:
Recommend the affiliated relevant auxiliary lexicon of classification with the active user;
And/or, the personalized search result is provided;
And/or, recommend the affiliated relevant customizing messages of classification with the active user.
14, method as claimed in claim 9 is characterized in that, described individual info service comprises provides the personalized search result, and described personalized search result comprises:
Personalization results ordering and/or result at this user filter;
And/or, at the information search result of this user's other types;
And/or, recommend at this user's relevant search key element.
15, a kind ofly provide the system of individual info service, it is characterized in that, comprising at client user group:
The dictionary memory module is used to compile a plurality of input method client users' the user thesaurus and the corresponding relation of recording user and its user thesaurus; Described user thesaurus comprises speech and word frequency;
The calculation of characteristic parameters module is used for the user thesaurus at each user, calculates this user's characteristic parameter;
The cluster module is used to calculate the similarity between each user characteristics parameter, finishes cluster and record to each user;
The classification information application module is used for the classification information according to a user, provides individual info service to this user.
16, system as claimed in claim 15 is characterized in that, described user thesaurus also comprises:
Application software and use information thereof;
And/or, the binary or the N metamessage of incidence relation between the sign words;
And/or, user's registration information;
And/or, according to importing the recessive attribute information of user that the historical record analysis obtains;
And/or, the recessive attribute information of the user who obtains according to the user's registration information analysis.
17, system as claimed in claim 15 is characterized in that, described calculation of characteristic parameters module further comprises: the preliminary treatment submodule is used for carrying out preliminary treatment at user thesaurus.
18, system as claimed in claim 17 is characterized in that, described preprocessing process comprises:
Directly handle and obtain required customer attribute information at the information in the user thesaurus;
Perhaps, obtain other information, handle obtaining required customer attribute information with user thesaurus from input method client.
19, system as claimed in claim 15 is characterized in that, described individual info service comprises:
Recommend the affiliated relevant auxiliary lexicon of classification with the active user;
And/or, the personalized search result is provided;
And/or, recommend the affiliated relevant customizing messages of classification with the active user.
20, system as claimed in claim 15 is characterized in that, described individual info service comprises provides the personalized search result, and described personalized search result comprises:
Personalization results ordering and/or result at this user filter;
And/or, at the information search result of this user's other types;
And/or, recommend at this user's relevant search key element.
21, a kind of input method system comprises input interface unit, dictionary and coupling display unit, it is characterized in that described input method system also comprises:
Record cell is used to write down this user's input information; Described input information comprises speech and word frequency;
Pretreatment unit is used for analyzing at this user's input information, obtains this user's recessive attribute information;
User thesaurus makes up module, is used to generate user thesaurus, and described user thesaurus comprises speech and word frequency, this user's recessive attribute information;
Communication unit is used to transmit this user ID and described user thesaurus to server end.
22, input method system as claimed in claim 21 is characterized in that, described user thesaurus also comprises: application software and use information thereof; And/or, user's registration information.
23, input method system as claimed in claim 21 is characterized in that, also comprises:
The user category information memory cell is used to obtain the active user's that server end obtains according to a plurality of user thesaurus analyses classification information and storage;
The classification information applying unit is used for the classification information according to this user, provides individual info service to this user.
24, input method system as claimed in claim 23 is characterized in that, described individual info service comprises:
Recommend the affiliated relevant auxiliary lexicon of classification with the active user;
And/or, the personalized search result is provided;
And/or, recommend the affiliated relevant customizing messages of classification with the active user.
CN2007101761781A 2007-10-22 2007-10-22 Method and system for clustering customer terminal user group Active CN101420313B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2007101761781A CN101420313B (en) 2007-10-22 2007-10-22 Method and system for clustering customer terminal user group

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2007101761781A CN101420313B (en) 2007-10-22 2007-10-22 Method and system for clustering customer terminal user group

Publications (2)

Publication Number Publication Date
CN101420313A true CN101420313A (en) 2009-04-29
CN101420313B CN101420313B (en) 2011-01-12

Family

ID=40630938

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2007101761781A Active CN101420313B (en) 2007-10-22 2007-10-22 Method and system for clustering customer terminal user group

Country Status (1)

Country Link
CN (1) CN101420313B (en)

Cited By (38)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102063458A (en) * 2010-10-12 2011-05-18 百度在线网络技术(北京)有限公司 Method and device for user clustering in network equipment in computer network
CN102123032A (en) * 2010-01-07 2011-07-13 微软公司 Maintaining privacy during user profiling
CN102184230A (en) * 2011-05-11 2011-09-14 北京百度网讯科技有限公司 Method and device for displaying search results
CN102238045A (en) * 2010-04-27 2011-11-09 广州迈联计算机科技有限公司 System and method for predicting user behavior in wireless Internet
CN102496127A (en) * 2011-12-05 2012-06-13 哈尔滨工业大学 Method for identifying abnormality of customers in futures market
CN102541886A (en) * 2010-12-20 2012-07-04 郝敬涛 System and method for identifying relationship among user group and users
CN102637178A (en) * 2011-02-14 2012-08-15 北京瑞信在线系统技术有限公司 Music recommending method, music recommending device and music recommending system
CN102693229A (en) * 2011-03-22 2012-09-26 腾讯科技(深圳)有限公司 Analysis method, recommendation method, analysis apparatus and recommendation apparatus for software
CN102831114A (en) * 2011-06-14 2012-12-19 北京思博途信息技术有限公司 Method and device for realizing statistical analysis on user access condition of Internet
CN102866786A (en) * 2012-09-11 2013-01-09 广东威创视讯科技股份有限公司 User preference based input method selection method and system
CN103136319A (en) * 2011-11-29 2013-06-05 网际智慧股份有限公司 Method for automatically analyzing personalized input
CN103631803A (en) * 2012-08-23 2014-03-12 百度国际科技(深圳)有限公司 Method, device and server for advertisement orientation based on input behaviors
CN103778232A (en) * 2014-01-26 2014-05-07 百度在线网络技术(北京)有限公司 Method and device for processing personalized information
CN103927299A (en) * 2014-04-25 2014-07-16 百度在线网络技术(北京)有限公司 Method for providing candidate sentences in input method and method and device for recommending input content
WO2014161426A1 (en) * 2013-04-01 2014-10-09 Tencent Technology (Shenzhen) Company Limited Knowledge graph mining method and system
CN104424235A (en) * 2013-08-26 2015-03-18 腾讯科技(深圳)有限公司 Method and device for clustering user information
CN105100164A (en) * 2014-05-20 2015-11-25 深圳市腾讯计算机系统有限公司 Network service recommendation method and device
CN105138143A (en) * 2015-08-28 2015-12-09 百度在线网络技术(北京)有限公司 Method and device for obtaining term database
CN105187237A (en) * 2015-08-12 2015-12-23 百度在线网络技术(北京)有限公司 Method and device for searching associated user identifications
CN105556514A (en) * 2014-06-25 2016-05-04 北京百付宝科技有限公司 Method and device for data mining based on user's search behaviour
CN105608496A (en) * 2015-11-09 2016-05-25 国家电网公司 Reason analysis method for sharp increase of distribution rush-repair work orders based on k-means clustering algorithm
CN106774970A (en) * 2015-11-24 2017-05-31 北京搜狗科技发展有限公司 The method and apparatus being ranked up to the candidate item of input method
CN106874308A (en) * 2015-12-14 2017-06-20 北京搜狗科技发展有限公司 It is a kind of to recommend method and apparatus, a kind of device for recommending
CN106910084A (en) * 2015-12-23 2017-06-30 滴滴(中国)科技有限公司 Electronic ticket distribution method and device
CN107194815A (en) * 2016-11-15 2017-09-22 平安科技(深圳)有限公司 Client segmentation method and system
CN107436896A (en) * 2016-05-26 2017-12-05 北京搜狗科技发展有限公司 Method, apparatus and electronic equipment are recommended in one kind input
CN107798094A (en) * 2017-10-26 2018-03-13 北京百度网讯科技有限公司 Method and apparatus for inputting words
WO2018215912A1 (en) * 2017-05-24 2018-11-29 International Business Machines Corporation A method to estimate the deletability of data objects
CN109213799A (en) * 2017-06-29 2019-01-15 北京搜狗科技发展有限公司 A kind of recommended method and device of cell dictionary
CN109242022A (en) * 2018-09-11 2019-01-18 杭州飞弛网络科技有限公司 A kind of location-based stranger's social activity sharing type method of payment and system
CN109308334A (en) * 2018-11-13 2019-02-05 北京搜狗科技发展有限公司 Information recommendation method and device, search engine system
US10210148B2 (en) 2010-08-02 2019-02-19 Lenovo (Beijing) Limited Method and apparatus for file processing
CN110263126A (en) * 2019-06-20 2019-09-20 维沃移动通信有限公司 A kind of generation method and mobile terminal of user's portrait
CN110287173A (en) * 2018-03-19 2019-09-27 奥多比公司 Automatically generate significant user segment
CN111079653A (en) * 2019-12-18 2020-04-28 中国工商银行股份有限公司 Automatic database sorting method and device
CN111581522A (en) * 2020-06-05 2020-08-25 预见你情感(北京)教育咨询有限公司 Social analysis method based on user identity identification
CN112905591A (en) * 2021-02-04 2021-06-04 成都信息工程大学 Data table connection sequence selection method based on machine learning
CN115052668A (en) * 2019-11-06 2022-09-13 奈安蒂克公司 Zone partitioning based on player density for zone chat

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101923545B (en) * 2009-06-15 2012-10-10 北京百分通联传媒技术有限公司 Method for recommending personalized information

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060106769A1 (en) * 2004-11-12 2006-05-18 Gibbs Kevin A Method and system for autocompletion for languages having ideographs and phonetic characters
CN101000627B (en) * 2007-01-15 2010-05-19 北京搜狗科技发展有限公司 Method and device for issuing correlation information
CN100464308C (en) * 2007-04-20 2009-02-25 北京搜狗科技发展有限公司 Method and system for updating user vocabulary synchronouslly

Cited By (63)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102123032A (en) * 2010-01-07 2011-07-13 微软公司 Maintaining privacy during user profiling
CN102238045A (en) * 2010-04-27 2011-11-09 广州迈联计算机科技有限公司 System and method for predicting user behavior in wireless Internet
US10210148B2 (en) 2010-08-02 2019-02-19 Lenovo (Beijing) Limited Method and apparatus for file processing
CN102063458A (en) * 2010-10-12 2011-05-18 百度在线网络技术(北京)有限公司 Method and device for user clustering in network equipment in computer network
CN102541886A (en) * 2010-12-20 2012-07-04 郝敬涛 System and method for identifying relationship among user group and users
CN102541886B (en) * 2010-12-20 2015-04-01 郝敬涛 System and method for identifying relationship among user group and users
CN102637178A (en) * 2011-02-14 2012-08-15 北京瑞信在线系统技术有限公司 Music recommending method, music recommending device and music recommending system
CN102693229A (en) * 2011-03-22 2012-09-26 腾讯科技(深圳)有限公司 Analysis method, recommendation method, analysis apparatus and recommendation apparatus for software
CN102693229B (en) * 2011-03-22 2016-01-20 深圳市腾讯计算机系统有限公司 Software analysis method, recommend method, analytical equipment and recommendation apparatus
CN102184230B (en) * 2011-05-11 2016-08-17 北京百度网讯科技有限公司 The methods of exhibiting of a kind of Search Results and device
CN102184230A (en) * 2011-05-11 2011-09-14 北京百度网讯科技有限公司 Method and device for displaying search results
CN102831114A (en) * 2011-06-14 2012-12-19 北京思博途信息技术有限公司 Method and device for realizing statistical analysis on user access condition of Internet
CN102831114B (en) * 2011-06-14 2015-09-16 北京思博途信息技术有限公司 Realize method and the device of internet user access Statistic Analysis
CN103136319A (en) * 2011-11-29 2013-06-05 网际智慧股份有限公司 Method for automatically analyzing personalized input
CN102496127A (en) * 2011-12-05 2012-06-13 哈尔滨工业大学 Method for identifying abnormality of customers in futures market
CN103631803A (en) * 2012-08-23 2014-03-12 百度国际科技(深圳)有限公司 Method, device and server for advertisement orientation based on input behaviors
CN102866786A (en) * 2012-09-11 2013-01-09 广东威创视讯科技股份有限公司 User preference based input method selection method and system
CN102866786B (en) * 2012-09-11 2016-03-30 广东威创视讯科技股份有限公司 Based on input method system of selection and the system of user preference
WO2014161426A1 (en) * 2013-04-01 2014-10-09 Tencent Technology (Shenzhen) Company Limited Knowledge graph mining method and system
CN104102635A (en) * 2013-04-01 2014-10-15 腾讯科技(深圳)有限公司 Method and device for digging knowledge graph
CN104102635B (en) * 2013-04-01 2018-05-11 腾讯科技(深圳)有限公司 A kind of method and device of Extracting Knowledge collection of illustrative plates
CN104424235A (en) * 2013-08-26 2015-03-18 腾讯科技(深圳)有限公司 Method and device for clustering user information
CN104424235B (en) * 2013-08-26 2018-01-05 腾讯科技(深圳)有限公司 The method and apparatus for realizing user profile cluster
WO2015109902A1 (en) * 2014-01-26 2015-07-30 百度在线网络技术(北京)有限公司 Personalized information processing method, device and apparatus, and nonvolatile computer storage medium
CN103778232A (en) * 2014-01-26 2014-05-07 百度在线网络技术(北京)有限公司 Method and device for processing personalized information
CN103927299A (en) * 2014-04-25 2014-07-16 百度在线网络技术(北京)有限公司 Method for providing candidate sentences in input method and method and device for recommending input content
CN105100164A (en) * 2014-05-20 2015-11-25 深圳市腾讯计算机系统有限公司 Network service recommendation method and device
CN105100164B (en) * 2014-05-20 2018-06-15 深圳市腾讯计算机系统有限公司 Network service recommends method and apparatus
US10331734B2 (en) * 2014-05-20 2019-06-25 Tencent Technology (Shenzhen) Company Limited Method and apparatus for recommending network service
WO2015176656A1 (en) * 2014-05-20 2015-11-26 Tencent Technology (Shenzhen) Company Limited Method and apparatus for recommending network service
CN105556514B (en) * 2014-06-25 2020-11-20 北京百度网讯科技有限公司 Method and device for data mining based on user search behavior
US10896461B2 (en) 2014-06-25 2021-01-19 Beijing Baidu Netcom Science And Technology Co., Ltd. Method and apparatus for data mining based on users' search behavior
CN105556514A (en) * 2014-06-25 2016-05-04 北京百付宝科技有限公司 Method and device for data mining based on user's search behaviour
CN105187237A (en) * 2015-08-12 2015-12-23 百度在线网络技术(北京)有限公司 Method and device for searching associated user identifications
CN105187237B (en) * 2015-08-12 2018-09-11 百度在线网络技术(北京)有限公司 The method and apparatus for searching associated user identifier
CN105138143A (en) * 2015-08-28 2015-12-09 百度在线网络技术(北京)有限公司 Method and device for obtaining term database
CN105608496B (en) * 2015-11-09 2021-07-27 国家电网公司 Reason analysis method for sudden increase of allocation and preemption work orders based on k-means clustering algorithm
CN105608496A (en) * 2015-11-09 2016-05-25 国家电网公司 Reason analysis method for sharp increase of distribution rush-repair work orders based on k-means clustering algorithm
CN106774970A (en) * 2015-11-24 2017-05-31 北京搜狗科技发展有限公司 The method and apparatus being ranked up to the candidate item of input method
CN106874308A (en) * 2015-12-14 2017-06-20 北京搜狗科技发展有限公司 It is a kind of to recommend method and apparatus, a kind of device for recommending
CN106874308B (en) * 2015-12-14 2021-03-26 北京搜狗科技发展有限公司 Recommendation method and device and recommendation device
CN106910084A (en) * 2015-12-23 2017-06-30 滴滴(中国)科技有限公司 Electronic ticket distribution method and device
CN107436896A (en) * 2016-05-26 2017-12-05 北京搜狗科技发展有限公司 Method, apparatus and electronic equipment are recommended in one kind input
CN107194815A (en) * 2016-11-15 2017-09-22 平安科技(深圳)有限公司 Client segmentation method and system
GB2576453A (en) * 2017-05-24 2020-02-19 Ibm A Method To Estimate The Deletability Of The Data Objects
CN110679114A (en) * 2017-05-24 2020-01-10 国际商业机器公司 Method for estimating deletability of data object
CN110679114B (en) * 2017-05-24 2021-08-06 国际商业机器公司 Method for estimating deletability of data object
WO2018215912A1 (en) * 2017-05-24 2018-11-29 International Business Machines Corporation A method to estimate the deletability of data objects
US10956453B2 (en) 2017-05-24 2021-03-23 International Business Machines Corporation Method to estimate the deletability of data objects
CN109213799A (en) * 2017-06-29 2019-01-15 北京搜狗科技发展有限公司 A kind of recommended method and device of cell dictionary
CN107798094A (en) * 2017-10-26 2018-03-13 北京百度网讯科技有限公司 Method and apparatus for inputting words
CN110287173A (en) * 2018-03-19 2019-09-27 奥多比公司 Automatically generate significant user segment
CN110287173B (en) * 2018-03-19 2024-01-26 奥多比公司 Automatically generating meaningful user segments
CN109242022A (en) * 2018-09-11 2019-01-18 杭州飞弛网络科技有限公司 A kind of location-based stranger's social activity sharing type method of payment and system
CN109242022B (en) * 2018-09-11 2021-04-27 杭州飞弛网络科技有限公司 Stranger social sharing type payment method and system based on position
CN109308334A (en) * 2018-11-13 2019-02-05 北京搜狗科技发展有限公司 Information recommendation method and device, search engine system
CN109308334B (en) * 2018-11-13 2021-03-09 北京搜狗科技发展有限公司 Information recommendation method and device and search engine system
CN110263126A (en) * 2019-06-20 2019-09-20 维沃移动通信有限公司 A kind of generation method and mobile terminal of user's portrait
CN115052668A (en) * 2019-11-06 2022-09-13 奈安蒂克公司 Zone partitioning based on player density for zone chat
CN111079653A (en) * 2019-12-18 2020-04-28 中国工商银行股份有限公司 Automatic database sorting method and device
CN111079653B (en) * 2019-12-18 2024-03-22 中国工商银行股份有限公司 Automatic database separation method and device
CN111581522A (en) * 2020-06-05 2020-08-25 预见你情感(北京)教育咨询有限公司 Social analysis method based on user identity identification
CN112905591A (en) * 2021-02-04 2021-06-04 成都信息工程大学 Data table connection sequence selection method based on machine learning

Also Published As

Publication number Publication date
CN101420313B (en) 2011-01-12

Similar Documents

Publication Publication Date Title
CN101420313B (en) Method and system for clustering customer terminal user group
Li et al. Document representation and feature combination for deceptive spam review detection
CN101520784B (en) Information issuing system and information issuing method
Wan et al. Aminer: Search and mining of academic social networks
US8972408B1 (en) Methods, systems, and articles of manufacture for addressing popular topics in a social sphere
CN107958385B (en) Bidding based on buyer defined function
CN101329674A (en) System and method for providing personalized searching
CN106776881A (en) A kind of realm information commending system and method based on microblog
Kanwal et al. A review of text-based recommendation systems
CN103136360A (en) Internet behavior markup engine and behavior markup method corresponding to same
Yun et al. Computationally analyzing social media text for topics: A primer for advertising researchers
CN112989038B (en) Sentence-level user portrait generation method and device and storage medium
Paul et al. Focused domain contextual AI chatbot framework for resource poor languages
US20130159235A1 (en) Methods and Systems For Investigation of Compositions of Ontological Subjects
Shahid et al. Insights into relevant knowledge extraction techniques: a comprehensive review
CN112989208B (en) Information recommendation method and device, electronic equipment and storage medium
Phu et al. A valences-totaling model for English sentiment classification
Shi et al. A Word2vec model for sentiment analysis of weibo
CN111737607B (en) Data processing method, device, electronic equipment and storage medium
CN111126073B (en) Semantic retrieval method and device
Wu et al. Understanding customers using Facebook Pages: Data mining users feedback using text analysis
Hu et al. Sexism and male self-cognitive crisis: Sentiment and discourse analysis of an internet event
CN106055702B (en) Internet-oriented data service unified description method
Kamel et al. Robust sentiment fusion on distribution of news
Im et al. A study on brand identity and image utilizing SNA

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant