CN104317784A

CN104317784A - Cross-platform user identification method and cross-platform user identification system

Info

Publication number: CN104317784A
Application number: CN201410521299.5A
Authority: CN
Inventors: 李寿山; 黄磊; 周国栋; 王红玲
Original assignee: Suzhou University
Current assignee: Suzhou University
Priority date: 2014-09-30
Filing date: 2014-09-30
Publication date: 2015-01-28

Abstract

The invention discloses a cross-platform user identification method and a cross-platform user identification system, which take the importance of use messages in social platforms into full consideration and identify whether a user is the same user according to the similarity of personalized information, such as user knowledge, interests, preferences, writing styles and wording habits, reflected by the user messages in two accounts of different platforms within a corresponding period of time. Specifically, the method comprises the steps that obtains message contents, which are released within a preset period of time, in the two accounts of the different platforms are obtained, word segmentation and feature extraction treatment are carried out on the message contents of the two accounts, and on the basis, by utilizing the similarity between the segmented word features of the messages of the two accounts, whether the two accounts of the different platforms belong to the same user is identified. Thus, the cross-platform user identification method and the cross-platform user identification system solve the problem of how to identify the same user on different social platforms, and further provide support for the analysis of cross-platform data of the same user.

Description

A kind of cross-platform user identification method and system

Technical field

The invention belongs to natural language processing technique and field of social network, particularly relate to a kind of cross-platform user identification method and system.

Background technology

In recent years, along with the fast development of social networks, various types of microblogging (Micro-blog), such as Sina's microblogging, Tengxun's microblogging, Twitter, Facebook etc., be day by day subject to the favor of user.

Because microblogging had both had broadcasting media characteristic, there is again social networks characteristic, attracted numerous researchist to analyze and research to microblog data.At present, the user simultaneously having multiple different platform microblogging account gets more and more, such as user has Sina's account and Tengxun's account etc. simultaneously, the microblog data (such as Twitter message) of same subscriber in different platform is studied simultaneously, more be conducive to that the interest to user, preference etc. carry out multianalysis, the degree of depth is excavated, thus be more conducive to enterprise and formulate personalized marketing strategy, carry out advertisement putting accurately; Meanwhile, be also more conducive to compare in the use motivation of different platform, use habit to same user, for social networks operation or develop new social networks product and provide better reference role.

But be almost in the blank stage for the Study of recognition across the same user of social platform at present, whether the account of None-identified different platform belongs to same user, and therefore, the identification problem of the same user of different social platform becomes the current problem needing solution badly.

Summary of the invention

In view of this, the object of the present invention is to provide a kind of cross-platform user identification method and system, to solve the identification problem of the same user of different social platform, and then provide support for the cross-platform data analysis of same user.

For this reason, the present invention's openly following technical scheme:

A kind of cross-platform user identification method, comprising:

Obtain the first message section of first user account on the first platform, obtain the second message section of the second user account on the second platform, wherein, described first message section is the message section be made up of all message of issuing time in the first preset time period in described first user account, and described second message section is the message section be made up of all message of issuing time in the first preset time period in described second user account;

Respectively word segmentation processing is carried out to described first message section and described second message section, obtain the first message section of point word form and the second message section of point word form;

Carry out feature extraction based on second message section of the participle feature preset to the first message section of described point of word form and point word form, and on the basis of feature extraction, obtain the characteristic similarity numerical value of described first message section and described second message section;

Judge described characteristic similarity numerical value whether within the scope of the similarity numeric reference preset;

If the determination result is YES, then described first user account and described second user account belong to same user.

Said method, preferably, second message section of the described participle feature based on presetting to the first message section of described point of word form and point word form carries out feature extraction, and on the basis of feature extraction, obtain the characteristic similarity numerical value of described first message section and described second message section, comprising:

Respectively the feature extraction of ternary word is carried out to the first message section of point word form and the second message section of point word form, and comprise similarity numerical value based on the word of both number acquisitions of the identical ternary word comprised in the first message section and the second message section;

Respectively high frequency words feature extraction is carried out to the first message section of point word form and the second message section of point word form, and the high frequency words similarity numerical value both obtaining based on the number of the identical high frequency words comprised in the first message section and the second message section;

Respectively the extraction of monocase probability of occurrence is carried out to the first message section of point word form and the second message section of point word form, and obtain both word distribution similarity numerical value based on the probability of occurrence of the identical monocase comprised in the first message section and the second message section;

Respectively the implicit theme of the first message section of point word form and the second message section of point word form is extracted, and the Topic Similarity numerical value both obtaining based on the number of the same subject comprised in the first message section and the second message section.

Said method, preferably, before feature extraction is carried out to the first message section of point word form and the second message section of point word form, also comprise: carry out filtration treatment to the first message section of described point of word form and the second message section of point word form respectively, described filtration treatment comprises:

Stop words is gone to the first message section of described point of word form and goes low-frequency word process;

Stop words is gone to the second message section of described point of word form and goes low-frequency word process.

Said method, preferably, also comprises:

Utilize the message section sample pair of setting number in advance, and based on the characteristic similarity that each message section sample is right, cross-platform user's recognition training is carried out to maximum entropy sorting technique, obtain maximum entropy classifiers, to realize adopting on described maximum entropy classifiers identification first platform the second user account in first user account and the second platform whether to belong to same user, wherein:

Two message sections that described message section sample centering comprises belong to two accounts of different platform respectively, described two accounts are the account of same subscriber or the account of different user, described message section sample centering comprise the issuing time of message in the second preset time period;

Described characteristic similarity comprises word and comprises similarity, high frequency words similarity, word distribution similarity and Topic Similarity.

Said method, preferably, the word obtaining both by the relative entropy D (p||q) calculating the first message section and the second message section distributes similarity numerical value;

Wherein, p, q represent the first message section, the second message section respectively, p (x), q (x) represent that the probability that identical monocase x occurs respectively in the first message section and the second message section, X represent the character set of the first message section and identical monocase in the second message section.

Said method, preferably, uses document subject matter generation model LDA to the first message section of point word form and divides the implicit theme of the second message section of word form to extract.

A kind of cross-platform user's recognition system, comprising:

Message capturing module, for obtaining the first message section of first user account on the first platform, obtain the second message section of the second user account on the second platform, wherein, described first message section is the message section be made up of all message of issuing time in the first preset time period in described first user account, and described second message section is the message section be made up of all message of issuing time in the first preset time period in described second user account;

Word segmentation processing module, for carrying out word segmentation processing to described first message section and described second message section respectively, obtains the first message section of point word form and the second message section of point word form;

Feature extraction module, for carrying out feature extraction based on second message section of the participle feature preset to the first message section of described point of word form and point word form, and on the basis of feature extraction, obtain the characteristic similarity numerical value of described first message section and described second message section;

Judge module, for judging described characteristic similarity numerical value whether within the scope of the similarity numeric reference preset;

Identification module, for when judged result is for being, identifies described first user account and described second user account belongs to same user.

Said system, preferably, described feature extraction module comprises:

First extracting unit, for carrying out the feature extraction of ternary word to the first message section of point word form and the second message section of point word form respectively, and comprise similarity numerical value based on the word of both number acquisitions of the identical ternary word comprised in the first message section and the second message section; ;

Second extracting unit, for carrying out high frequency words feature extraction to the first message section of point word form and the second message section of point word form respectively, and the high frequency words similarity numerical value both obtaining based on the number of the identical high frequency words comprised in the first message section and the second message section;

3rd extracting unit, for carrying out the extraction of monocase probability of occurrence to the first message section of point word form and the second message section of point word form respectively, and obtain both word distribution similarity numerical value based on the probability of occurrence of the identical monocase comprised in the first message section and the second message section;

4th extracting unit, for extracting the implicit theme of the first message section of point word form and the second message section of point word form respectively, and the Topic Similarity numerical value both obtaining based on the number of the same subject comprised in the first message section and the second message section.

Said system, preferably, also comprises: for carrying out the filtering module of filtration treatment respectively to the first message section of described point of word form and the second message section of point word form, described filtering module comprises:

First filter element, for removing stop words and going low-frequency word process to the first message section of described point of word form;

Second filter element, for removing stop words and going low-frequency word process to the second message section of described point of word form.

Said system, preferably, also comprises:

Pretreatment module, for utilizing the message section sample pair of setting number in advance, and based on the characteristic similarity that each message section sample is right, cross-platform user's recognition training is carried out to maximum entropy sorting technique, obtain maximum entropy classifiers, to realize adopting on described maximum entropy classifiers identification first platform the second user account in first user account and the second platform whether to belong to same user, wherein:

From above scheme, cross-platform user identification method disclosed by the invention and system, take into full account the importance of user message in social platform, by user's information that user message in two accounts of different platform in the corresponding time period reflects, interest, preference and writing style, the similar situation of the customized informations such as word custom, identify whether user is same user, particularly, the inventive method obtains the message content of issuing time in preset time period in two accounts of different platform, and participle and feature extraction process are carried out to the message content of two accounts, on this basis, two accounts of different platform described in the participle characteristic similarity identification of two account information are utilized whether to belong to same user.Visible, the invention solves the identification problem of the same user of different social platform, and then provide support for the cross-platform data analysis of same user.

Accompanying drawing explanation

In order to be illustrated more clearly in the embodiment of the present invention or technical scheme of the prior art, be briefly described to the accompanying drawing used required in embodiment or description of the prior art below, apparently, accompanying drawing in the following describes is only embodiments of the invention, for those of ordinary skill in the art, under the prerequisite not paying creative work, other accompanying drawing can also be obtained according to the accompanying drawing provided.

Fig. 1 is a kind of process flow diagram of cross-platform user identification method disclosed in the embodiment of the present invention one;

Fig. 2 is the another kind of process flow diagram of cross-platform user identification method disclosed in the embodiment of the present invention two;

Fig. 3 is another process flow diagram of cross-platform user identification method disclosed in the embodiment of the present invention three;

Fig. 4 is a kind of structural representation of cross-platform user's recognition system disclosed in the embodiment of the present invention four;

Fig. 5 is the another kind of structural representation of cross-platform user's recognition system disclosed in the embodiment of the present invention four;

Fig. 6 is the another kind of again structural representation of cross-platform user's recognition system disclosed in the embodiment of the present invention four.

Embodiment

Below in conjunction with the accompanying drawing in the embodiment of the present invention, be clearly and completely described the technical scheme in the embodiment of the present invention, obviously, described embodiment is only the present invention's part embodiment, instead of whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art, not making the every other embodiment obtained under creative work prerequisite, belong to the scope of protection of the invention.

Embodiment one

The present embodiment one discloses a kind of cross-platform user identification method, and with reference to figure 1, described method can comprise the following steps:

S101: the first message section obtaining first user account on the first platform, obtain the second message section of the second user account on the second platform, wherein, described first message section is the message section be made up of all message of issuing time in the first preset time period in described first user account, and described second message section is the message section be made up of all message of issuing time in the first preset time period in described second user account.

The present embodiment with whether Sina's microblogging account and Tengxun's microblogging account are belonged to same user be identified as example the inventive method is described.

Particularly, API (the Application Programming Interface that Sina's microblogging can be adopted specially offered, application programming interface) from Sina's microblogging account of setting, capture the Sina user message of issuing time in preset time period, adopt the specially offered API of Tengxun's microblogging to capture the Sina user message of issuing time in preset time period from Tengxun's microblogging account of setting.Such as, from Sina microblogging account userid1, specifically capture all Message-texts that nearest three months relative users are delivered or forwarded, form text chunk 1; From Tengxun microblogging account userid2, capture all Message-texts that nearest three months relative users are delivered or forwarded, form text chunk 2.

S102: carry out word segmentation processing to described first message section and described second message section respectively, obtains the first message section of point word form and the second message section of point word form.

Participle, refers to the sequence Chinese sentence being divided into word, as become " I likes China " after " I likes China " participle.

This step adopts participle software FudanNLP to continue to carry out word segmentation processing to the message section (the application specifically adopts text chunk form) of different platform two accounts obtained, as, word segmentation processing is carried out to the text chunk 1 of Sina microblogging account userid1 and the text chunk 2 of Tengxun microblogging account userid2.

S103: carry out feature extraction based on second message section of the participle feature preset to the first message section of described point of word form and point word form, and obtain the characteristic similarity numerical value of described first message section and described second message section on the basis of feature extraction.

Wherein, step S103 comprises:

A, respectively the feature extraction of ternary word is carried out to the first message section of point word form and the second message section of point word form, and the word both obtaining based on the number of the identical ternary word comprised in the first message section and the second message section comprises similarity numerical value.

The number of the identical ternary word that this step comprises according to two text chunks, judge the similarity degree of two text chunks, the number comprising identical ternary word is more, then think that similar degree is larger.

Ternary word refers to the structure be made up of 3 participles be connected in Message-text.

Step a realizes extracting each ternary word that the text chunk of two after participle comprises.Such as, the text A after participle is supposed: have a holiday or vacation and do not know what does; Text B: do not know what does.Then can extract 4 ternary words (1) from text A and have a holiday or vacation not only not (2) but also do not know that (3) do not know doing (4) knows what does.3 ternary words can be extracted from text B, be respectively: (1) does not know that doing (2) knows doing what (3) does what.

Afterwards, the number of the identical ternary word comprised according to two text chunks determines that the word of two text chunks comprises similarity numerical value.

Such as, if the number that two text chunks comprise identical ternary word is 0, then the word of two text chunks comprises similarity numerical value is 0; If the number comprising identical ternary word belongs to 0 ~ 50, then word comprises similarity numerical value is 1; If the number comprising identical ternary word belongs to 50 ~ 100, then word comprises similarity numerical value is 2; If the number comprising identical ternary word is greater than 100, then word comprises similarity numerical value is 3.

B, respectively high frequency words feature extraction is carried out to the first message section of point word form and the second message section of point word form, and the high frequency words similarity numerical value both obtaining based on the number of the identical high frequency words comprised in the first message section and the second message section.

The number of the identical high frequency words comprised in two sections of texts is more, shows that two sections of texts are more similar.

Wherein, first this step sorts to the order of the participle in each text chunk by participle frequency of occurrences descending, and such as, the segmentation sequence after " I is me " sequence is " I is ".

Afterwards, add up the number of identical participle in the high frequency words of foremost predetermined number in collating sequence corresponding to two text chunks, such as, add up the number of identical participle in front 100 high frequency words in two sequences, and determine the high frequency words similarity numerical value of two text chunks according to the number of identical participle.

Such as, when the number that can specify identical high frequency words is 0, high frequency words similarity numerical value is 0; When the number of identical high frequency words belongs to 0 ~ 20, high frequency words similarity numerical value is 1; When the number of identical high frequency words belongs to 20 ~ 50, high frequency words similarity numerical value is 2; When the number of identical high frequency words belongs to 50 ~ 100, high frequency words similarity numerical value is 3.

C, respectively the extraction of monocase probability of occurrence is carried out to the first message section of point word form and the second message section of point word form, and obtain both word distribution similarity numerical value based on the probability of occurrence of the identical monocase comprised in the first message section and the second message section.

This step obtains both word distribution similarity numerical value by the relative entropy D (p||q) of calculating two text chunk, p, q represent the first message section, the second message section respectively, p (x), q (x) represent that the probability that identical monocase x occurs respectively in the first message section and the second message section, X represent the character set of the first message section and identical monocase in the second message section.

Relative entropy numerical value is also referred to as KL (Kullback-Leibler divergence) distance, and the KL distance of two sections of words is less, and represent that the difference of the character stochastic distribution of these two sections of words is less, namely these two sections of words are more similar in the distribution of word.

Such as, p: I is me; Q: I am Chinese.Then in text p, the probability of ' I ' is p (I)=2/3, and the probability of ' being ' is p (' be ')=1/3; In text q, the probability of all characters is all 1/5, according to formula the KL distance of known text p and q is: D ((p//q))=(p (I) * log (p (I)/q (I)))+(p (YES) * log (p (YES)/q (YES))).

On this basis, according to the distribution similarity numerical value of KL distance acquisition two text chunks of two text chunks.

D, respectively the implicit theme of the first message section of point word form and the second message section of point word form to be extracted, and the Topic Similarity numerical value both obtaining based on the number of the same subject comprised in the first message section and the second message section.

The Twitter message text delivered in different social media due to same user has very large similarity, and based on this, if the number of same subject that two text chunks comprise is more, then to come from the possibility of same user larger for two text chunks.

This step, by the content of analysis two text chunks, adopts LDA (using document subject matter generation model) to extract its implicit theme, often occurs game, military information in the microblogging as hypothesis text chunk A; Often there is entertainment information in the microblogging of text chunk B, Taobao does shopping; Through LDA algorithm, the implicit theme of text chunk A is just game, military affairs etc., and the theme of text chunk B is then amusement, net purchase etc.

In follow-up foundation two text chunks, the number of same subject determines the Topic Similarity numerical value of two text chunks.Such as, if the number of same subject is 0, then Topic Similarity numerical characteristics value is 0; If the number of same subject is 1, Topic Similarity numerical value is 1; If the number of same subject is 2, Topic Similarity numerical value 2; If the number of same subject 3, Topic Similarity numerical value 3 etc.

Wherein, word comprises the different similarity degrees that the values such as similarity numerical value, high frequency words similarity numerical value or Topic Similarity numerical value 0,1,2,3 etc. only represent two text chunks, and numerical value is larger, and similarity degree is higher.

S104: judge described characteristic similarity numerical value whether within the scope of the similarity numeric reference preset.

S105: if the determination result is YES, then described first user account and described second user account belong to same user.

This step is by each similarity numerical value of two text chunks by reality acquisition, namely word comprises similarity numerical value, high frequency words similarity numerical value, word distribution similarity numerical value and Topic Similarity numerical value and prespecified reference data and compares, identify whether two text chunks (coming from two accounts of different platform respectively) belong to same user, and then realize cross-platform user identification.

Such as, suppose that the reference data that the account of two prespecified different platforms belongs to same user is: each characteristic similarity numerical value is all greater than 2.

Thus when each characteristic similarity numerical value only obtained when reality is all greater than 2, two text chunks belong to same user, and then to identify two accounts be same user; Otherwise during reference data against regulation, two accounts are different user.

From above scheme, cross-platform user identification method disclosed by the invention, take into full account the importance of user message in social platform, by user's information that user message in two accounts of different platform in the corresponding time period reflects, interest, preference and writing style, the similar situation of the customized informations such as word custom, identify whether user is same user, particularly, the inventive method obtains the message content of issuing time in preset time period in two accounts of different platform, and participle and feature extraction process are carried out to the message content of two accounts, on this basis, two accounts of different platform described in the participle characteristic similarity identification of two account information are utilized whether to belong to same user.Visible, the invention solves the identification problem of the same user of different social platform, and then provide support for the cross-platform data analysis of same user.

Embodiment two

In the present embodiment two, with reference to figure 2, described cross-platform user method can also comprise the following steps between step S102 and S103:

S106: respectively filtration treatment is carried out to the first message section of described point of word form and the second message section of point word form.

Wherein, this step comprises:

Stop words is gone to the first message section of described point of word form and goes low-frequency word process; Stop words is gone to the second message section of described point of word form and goes low-frequency word process.

Particularly, the message back that social platform user issues is toward more, number of times is frequent, the Twitter message issued of such as microblog users is more etc., cause gathered Message-text excessive, in the present embodiment, in order to improve the recognition speed of cross-platform user, respectively stop words gone to the text chunk of different platform two accounts and go low-frequency word (such as filtering out the participle that word frequency is less than 3) to process, namely the participle that reference value is relatively low is removed, reduce the dimension of proper vector, achieve when relatively not affecting recognition accuracy, accelerate recognition speed.

Embodiment three

In the present embodiment three, with reference to figure 3, described cross-platform user identification method can also comprise:

S107: the message section sample pair utilizing setting number in advance, and based on the characteristic similarity that each message section sample is right, cross-platform user's recognition training is carried out to maximum entropy sorting technique, obtain maximum entropy classifiers, whether belong to same user to realize adopting on described maximum entropy classifiers identification first platform the second user account in first user account and the second platform.

Wherein, two message sections that described message section sample centering comprises belong to two accounts of different platform respectively, described two accounts are the account of same subscriber or the account of different user, described message section sample centering comprise the issuing time of message in the second preset time period; Described characteristic similarity comprises word and comprises similarity, high frequency words similarity, word distribution similarity and Topic Similarity.

In order to improve the accuracy rate that cross-platform user identifies, this step utilizes the message section sample of setting scale to as training sample in advance, carries out cross-platform user's recognition training, obtain maximum entropy classifiers to maximum entropy sorting technique.

Be described below by way of the building process of instantiation to maximum entropy classifiers.

Collect described two kinds of platform accounts (i.e. userid) that 1200 have each user in the user of Sina's microblogging account and Tengxun's microblogging account simultaneously, obtain 1200 Sina microblogging userid and 1200 the Tengxun microblogging userid of 1200 users; Collect the Sina microblogging userid of each user in the user of 1200 Ge Jinyou Sina microblogging accounts, collect the Tengxun microblogging userid of each user in the user of 1200 Ge Jinyou Tengxun microblogging accounts.And by different according to platform for the userid collected, be built into two account list: Sina's account list and Tengxun's account list.

On this basis, utilize Sina's microblogging and the specially offered api interface of Tengxun's microblogging respectively, capture according to user list all Twitter messages that user delivered at nearly three months, obtain the Message-text section of each userid.Adopt participle software FudanNLP afterwards, respectively word segmentation processing is carried out to the text chunk of each userid, and the userid that the text chunk after participle is corresponding to account list is associated, wherein, every a line represents the Message-text section (point word form) of an account.

It is right to be organized between two by two text chunks belonging to same user under two kinds of platform accounts, organizes right between two, obtain 1200 text chunk samples pair with user altogether by cross-platform for all the other text chunks, and the text chunk sample pair of 1200 different users.

To each text chunk sample to carrying out word segmentation processing and removing stop words, go low-frequency word process.Afterwards, choose 1000 with user version section sample to and 1000 different user text chunk samples to the calculating carrying out feature extraction and characteristic similarity numerical value, formed training sample; Simultaneously to remaining 200 with user version section sample to and 200 different user text chunk samples to the calculating carrying out feature extraction and characteristic similarity numerical value, formed test sample book.Wherein, characteristic similarity numerical value comprises word and comprises similarity numerical value, high frequency words similarity numerical value, word distribution similarity numerical value and Topic Similarity numerical value.The acquisition process of word segmentation processing in this step, filtration treatment, feature extraction and characteristic similarity numerical value specifically can the explanation of reference example one, no longer describes in detail herein.

On this basis, based on the characteristic similarity that each training sample is right, utilize training sample to maximum entropy sorting technique carry out across user identify classification based training (it is a class categories that two text chunks belong to same user, do not belong to same user for another class categories), build maximum entropy classifiers.

Wherein, maximum entropy sorting technique is based on maximum entropy information theory, and its basic thought is all known factor Modling model, and the factor of all the unknowns is foreclosed, namely to find a kind of probability distribution, meet all known facts, but allow the most randomization of unknown factor.Relative to Nae Bayesianmethod, the maximum feature of the method is exactly the conditional sampling not between demand fulfillment feature and feature, and therefore, the method is applicable to the various different feature of statistics, and without the need to considering the impact between them.

Under maximum entropy model, the formula of predicted condition probability P (c|D) is as follows:

P (c_{i} | D) = \frac{1}{Z (D)} \exp (\underset{k}{Σ} λ_{k, c} F_{k, c} (D, c_{i})) - - - (1)

Wherein, Z (D) is normalized factor; λ _k,cfundamental function F _k,cweights, can λ be obtained in the process building base sorter _k,cvalue; The value of i is 1 or 0; Each feature in k representation feature space (in the application, specifically referring to each characteristic similarity), its value is from 1 to the size of feature space; F _k,cbe fundamental function, be defined as:

F_{k, c} (D, c^{'}) = \{\begin{matrix} 1, & n_{k} (d) > 0 and c^{'} \\ 0, & otherwiese \end{matrix} - - - (2)

Wherein, n _kd () represents the length of sample to be identified to comprised feature, in the application, and n _kd () is greater than 0 all the time; C represents that sample to be identified is to the legitimate reading whether belonging to same user, the result after c' presentation class device classification (identification), if the result of sorter identification and coming to the same thing really, then and F _k,cvalue be 1, if identify result and real result inconsistent, then F _k,cvalue be 0.

Such as, respectively above word is comprised, high frequency words, word distribution and Topic Similarity as the 1st, 2,3,4 feature, for each sample to be sorted, above 4 features are all exist (just the value of character representation is different), thus n _k(d)=4, then n _k(d) >0.

The follow-up classification performance originally testing constructed sorter by above test specimens, applicant has higher nicety of grading by the sorter constructed by the checking of actual test figure, does not adopt the recognition accuracy of sorter to have significantly to promote based on comparing to across user's recognition accuracy of sorter.

Embodiment four

The embodiment of the present invention four discloses a kind of cross-platform user's recognition system, and described system is corresponding with cross-platform user identification method disclosed in embodiment one to embodiment three.

First, corresponding to embodiment one, with reference to figure 4, described system comprises message capturing module 100, word segmentation processing module 200, feature extraction module 300, judge module 400 and identification module 500.

Message capturing module 100, for obtaining the first message section of first user account on the first platform, obtain the second message section of the second user account on the second platform, wherein, described first message section is the message section be made up of all message of issuing time in the first preset time period in described first user account, and described second message section is the message section be made up of all message of issuing time in the first preset time period in described second user account.

Word segmentation processing module 200, for carrying out word segmentation processing to described first message section and described second message section respectively, obtains the first message section of point word form and the second message section of point word form.

Feature extraction module 300, for carrying out feature extraction based on second message section of the participle feature preset to the first message section of described point of word form and point word form, and on the basis of feature extraction, obtain the characteristic similarity numerical value of described first message section and described second message section.

Wherein, feature extraction module 300 comprises the first extracting unit, the second extracting unit, the 3rd extracting unit and the 4th extracting unit.

Judge module 400, for judging described characteristic similarity numerical value whether within the scope of the similarity numeric reference preset.

Identification module 500, for when judged result is for being, identifies described first user account and described second user account belongs to same user.

Corresponding to embodiment two, with reference to figure 5, described system also comprises the filtering module 600 for carrying out filtration treatment respectively to the first message section of described point of word form and the second message section of point word form, and this module comprises the first filter element and the second filter element.

Corresponding to embodiment three, with reference to figure 6, described system also comprises pretreatment module 700, this module is used for the message section sample pair utilizing setting number in advance, and based on the characteristic similarity that each message section sample is right, cross-platform user's recognition training is carried out to maximum entropy sorting technique, obtain maximum entropy classifiers, to realize adopting on described maximum entropy classifiers identification first platform the second user account in first user account and the second platform whether to belong to same user, wherein:

For user's recognition system cross-platform disclosed in the embodiment of the present invention four, because it is corresponding with cross-platform user identification method disclosed in embodiment one to embodiment three, so description is fairly simple, relevant similarity refers to the explanation of cross-platform user identification method part in embodiment one to embodiment three, no longer describes in detail herein

In sum, the present invention takes into full account the importance of user message in social platform, the similar situation of the customized information such as user's information, interest, preference and writing style, word custom reflected by user message in two accounts of different platform in the corresponding time period, identify whether user is same user, and the accuracy rate of cross-platform user identification is improved by building maximum entropy classifiers in advance, solve the identification problem of the same user of different social platform, for the cross-platform data analysis of same user provides support.

It should be noted that, each embodiment in this instructions all adopts the mode of going forward one by one to describe, and what each embodiment stressed is the difference with other embodiments, between each embodiment identical similar part mutually see.

For convenience of description, various module or unit is divided into describe respectively with function when describing above system.Certainly, the function of each unit can be realized in same or multiple software and/or hardware when implementing the application.

Finally, also it should be noted that, in this article, the relational terms of such as first, second, third and fourth etc. and so on is only used for an entity or operation to separate with another entity or operational zone, and not necessarily requires or imply the relation that there is any this reality between these entities or operation or sequentially.And, term " comprises ", " comprising " or its any other variant are intended to contain comprising of nonexcludability, thus make to comprise the process of a series of key element, method, article or equipment and not only comprise those key elements, but also comprise other key elements clearly do not listed, or also comprise by the intrinsic key element of this process, method, article or equipment.When not more restrictions, the key element limited by statement " comprising ... ", and be not precluded within process, method, article or the equipment comprising described key element and also there is other identical element.

The above is only the preferred embodiment of the present invention; it should be pointed out that for those skilled in the art, under the premise without departing from the principles of the invention; can also make some improvements and modifications, these improvements and modifications also should be considered as protection scope of the present invention.

Claims

1. a cross-platform user identification method, is characterized in that, comprising:

2. to go the method described in 1 according to right, it is characterized in that, second message section of the described participle feature based on presetting to the first message section of described point of word form and point word form carries out feature extraction, and on the basis of feature extraction, obtain the characteristic similarity numerical value of described first message section and described second message section, comprising:

3. to go the method described in 1 according to right, it is characterized in that, before feature extraction is carried out to the first message section of point word form and the second message section of point word form, also comprise: carry out filtration treatment to the first message section of described point of word form and the second message section of point word form respectively, described filtration treatment comprises:

4. to go the method described in 1 according to right, it is characterized in that, also comprise:

5. method according to claim 2, is characterized in that, the word obtaining both by the relative entropy D (p||q) calculating the first message section and the second message section distributes similarity numerical value;

6. method according to claim 2, is characterized in that, uses document subject matter generation model LDA to the first message section of point word form and divides the implicit theme of the second message section of word form to extract.

7. cross-platform user's recognition system, is characterized in that, comprising:

8. will go the system described in 7 according to right, it is characterized in that, described feature extraction module comprises:

9. to go the system described in 7 according to right, it is characterized in that, also comprise: for carrying out the filtering module of filtration treatment respectively to the first message section of described point of word form and the second message section of point word form, described filtering module comprises:

10. to go the system described in 7 according to right, it is characterized in that, also comprise: