CN104317784A - Cross-platform user identification method and cross-platform user identification system - Google Patents

Cross-platform user identification method and cross-platform user identification system Download PDF

Info

Publication number
CN104317784A
CN104317784A CN201410521299.5A CN201410521299A CN104317784A CN 104317784 A CN104317784 A CN 104317784A CN 201410521299 A CN201410521299 A CN 201410521299A CN 104317784 A CN104317784 A CN 104317784A
Authority
CN
China
Prior art keywords
message section
message
word
point
word form
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410521299.5A
Other languages
Chinese (zh)
Inventor
李寿山
黄磊
周国栋
王红玲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou University
Original Assignee
Suzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou University filed Critical Suzhou University
Priority to CN201410521299.5A priority Critical patent/CN104317784A/en
Publication of CN104317784A publication Critical patent/CN104317784A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a cross-platform user identification method and a cross-platform user identification system, which take the importance of use messages in social platforms into full consideration and identify whether a user is the same user according to the similarity of personalized information, such as user knowledge, interests, preferences, writing styles and wording habits, reflected by the user messages in two accounts of different platforms within a corresponding period of time. Specifically, the method comprises the steps that obtains message contents, which are released within a preset period of time, in the two accounts of the different platforms are obtained, word segmentation and feature extraction treatment are carried out on the message contents of the two accounts, and on the basis, by utilizing the similarity between the segmented word features of the messages of the two accounts, whether the two accounts of the different platforms belong to the same user is identified. Thus, the cross-platform user identification method and the cross-platform user identification system solve the problem of how to identify the same user on different social platforms, and further provide support for the analysis of cross-platform data of the same user.

Description

A kind of cross-platform user identification method and system
Technical field
The invention belongs to natural language processing technique and field of social network, particularly relate to a kind of cross-platform user identification method and system.
Background technology
In recent years, along with the fast development of social networks, various types of microblogging (Micro-blog), such as Sina's microblogging, Tengxun's microblogging, Twitter, Facebook etc., be day by day subject to the favor of user.
Because microblogging had both had broadcasting media characteristic, there is again social networks characteristic, attracted numerous researchist to analyze and research to microblog data.At present, the user simultaneously having multiple different platform microblogging account gets more and more, such as user has Sina's account and Tengxun's account etc. simultaneously, the microblog data (such as Twitter message) of same subscriber in different platform is studied simultaneously, more be conducive to that the interest to user, preference etc. carry out multianalysis, the degree of depth is excavated, thus be more conducive to enterprise and formulate personalized marketing strategy, carry out advertisement putting accurately; Meanwhile, be also more conducive to compare in the use motivation of different platform, use habit to same user, for social networks operation or develop new social networks product and provide better reference role.
But be almost in the blank stage for the Study of recognition across the same user of social platform at present, whether the account of None-identified different platform belongs to same user, and therefore, the identification problem of the same user of different social platform becomes the current problem needing solution badly.
Summary of the invention
In view of this, the object of the present invention is to provide a kind of cross-platform user identification method and system, to solve the identification problem of the same user of different social platform, and then provide support for the cross-platform data analysis of same user.
For this reason, the present invention's openly following technical scheme:
A kind of cross-platform user identification method, comprising:
Obtain the first message section of first user account on the first platform, obtain the second message section of the second user account on the second platform, wherein, described first message section is the message section be made up of all message of issuing time in the first preset time period in described first user account, and described second message section is the message section be made up of all message of issuing time in the first preset time period in described second user account;
Respectively word segmentation processing is carried out to described first message section and described second message section, obtain the first message section of point word form and the second message section of point word form;
Carry out feature extraction based on second message section of the participle feature preset to the first message section of described point of word form and point word form, and on the basis of feature extraction, obtain the characteristic similarity numerical value of described first message section and described second message section;
Judge described characteristic similarity numerical value whether within the scope of the similarity numeric reference preset;
If the determination result is YES, then described first user account and described second user account belong to same user.
Said method, preferably, second message section of the described participle feature based on presetting to the first message section of described point of word form and point word form carries out feature extraction, and on the basis of feature extraction, obtain the characteristic similarity numerical value of described first message section and described second message section, comprising:
Respectively the feature extraction of ternary word is carried out to the first message section of point word form and the second message section of point word form, and comprise similarity numerical value based on the word of both number acquisitions of the identical ternary word comprised in the first message section and the second message section;
Respectively high frequency words feature extraction is carried out to the first message section of point word form and the second message section of point word form, and the high frequency words similarity numerical value both obtaining based on the number of the identical high frequency words comprised in the first message section and the second message section;
Respectively the extraction of monocase probability of occurrence is carried out to the first message section of point word form and the second message section of point word form, and obtain both word distribution similarity numerical value based on the probability of occurrence of the identical monocase comprised in the first message section and the second message section;
Respectively the implicit theme of the first message section of point word form and the second message section of point word form is extracted, and the Topic Similarity numerical value both obtaining based on the number of the same subject comprised in the first message section and the second message section.
Said method, preferably, before feature extraction is carried out to the first message section of point word form and the second message section of point word form, also comprise: carry out filtration treatment to the first message section of described point of word form and the second message section of point word form respectively, described filtration treatment comprises:
Stop words is gone to the first message section of described point of word form and goes low-frequency word process;
Stop words is gone to the second message section of described point of word form and goes low-frequency word process.
Said method, preferably, also comprises:
Utilize the message section sample pair of setting number in advance, and based on the characteristic similarity that each message section sample is right, cross-platform user's recognition training is carried out to maximum entropy sorting technique, obtain maximum entropy classifiers, to realize adopting on described maximum entropy classifiers identification first platform the second user account in first user account and the second platform whether to belong to same user, wherein:
Two message sections that described message section sample centering comprises belong to two accounts of different platform respectively, described two accounts are the account of same subscriber or the account of different user, described message section sample centering comprise the issuing time of message in the second preset time period;
Described characteristic similarity comprises word and comprises similarity, high frequency words similarity, word distribution similarity and Topic Similarity.
Said method, preferably, the word obtaining both by the relative entropy D (p||q) calculating the first message section and the second message section distributes similarity numerical value;
Wherein, p, q represent the first message section, the second message section respectively, p (x), q (x) represent that the probability that identical monocase x occurs respectively in the first message section and the second message section, X represent the character set of the first message section and identical monocase in the second message section.
Said method, preferably, uses document subject matter generation model LDA to the first message section of point word form and divides the implicit theme of the second message section of word form to extract.
A kind of cross-platform user's recognition system, comprising:
Message capturing module, for obtaining the first message section of first user account on the first platform, obtain the second message section of the second user account on the second platform, wherein, described first message section is the message section be made up of all message of issuing time in the first preset time period in described first user account, and described second message section is the message section be made up of all message of issuing time in the first preset time period in described second user account;
Word segmentation processing module, for carrying out word segmentation processing to described first message section and described second message section respectively, obtains the first message section of point word form and the second message section of point word form;
Feature extraction module, for carrying out feature extraction based on second message section of the participle feature preset to the first message section of described point of word form and point word form, and on the basis of feature extraction, obtain the characteristic similarity numerical value of described first message section and described second message section;
Judge module, for judging described characteristic similarity numerical value whether within the scope of the similarity numeric reference preset;
Identification module, for when judged result is for being, identifies described first user account and described second user account belongs to same user.
Said system, preferably, described feature extraction module comprises:
First extracting unit, for carrying out the feature extraction of ternary word to the first message section of point word form and the second message section of point word form respectively, and comprise similarity numerical value based on the word of both number acquisitions of the identical ternary word comprised in the first message section and the second message section; ;
Second extracting unit, for carrying out high frequency words feature extraction to the first message section of point word form and the second message section of point word form respectively, and the high frequency words similarity numerical value both obtaining based on the number of the identical high frequency words comprised in the first message section and the second message section;
3rd extracting unit, for carrying out the extraction of monocase probability of occurrence to the first message section of point word form and the second message section of point word form respectively, and obtain both word distribution similarity numerical value based on the probability of occurrence of the identical monocase comprised in the first message section and the second message section;
4th extracting unit, for extracting the implicit theme of the first message section of point word form and the second message section of point word form respectively, and the Topic Similarity numerical value both obtaining based on the number of the same subject comprised in the first message section and the second message section.
Said system, preferably, also comprises: for carrying out the filtering module of filtration treatment respectively to the first message section of described point of word form and the second message section of point word form, described filtering module comprises:
First filter element, for removing stop words and going low-frequency word process to the first message section of described point of word form;
Second filter element, for removing stop words and going low-frequency word process to the second message section of described point of word form.
Said system, preferably, also comprises:
Pretreatment module, for utilizing the message section sample pair of setting number in advance, and based on the characteristic similarity that each message section sample is right, cross-platform user's recognition training is carried out to maximum entropy sorting technique, obtain maximum entropy classifiers, to realize adopting on described maximum entropy classifiers identification first platform the second user account in first user account and the second platform whether to belong to same user, wherein:
Two message sections that described message section sample centering comprises belong to two accounts of different platform respectively, described two accounts are the account of same subscriber or the account of different user, described message section sample centering comprise the issuing time of message in the second preset time period;
Described characteristic similarity comprises word and comprises similarity, high frequency words similarity, word distribution similarity and Topic Similarity.
From above scheme, cross-platform user identification method disclosed by the invention and system, take into full account the importance of user message in social platform, by user's information that user message in two accounts of different platform in the corresponding time period reflects, interest, preference and writing style, the similar situation of the customized informations such as word custom, identify whether user is same user, particularly, the inventive method obtains the message content of issuing time in preset time period in two accounts of different platform, and participle and feature extraction process are carried out to the message content of two accounts, on this basis, two accounts of different platform described in the participle characteristic similarity identification of two account information are utilized whether to belong to same user.Visible, the invention solves the identification problem of the same user of different social platform, and then provide support for the cross-platform data analysis of same user.
Accompanying drawing explanation
In order to be illustrated more clearly in the embodiment of the present invention or technical scheme of the prior art, be briefly described to the accompanying drawing used required in embodiment or description of the prior art below, apparently, accompanying drawing in the following describes is only embodiments of the invention, for those of ordinary skill in the art, under the prerequisite not paying creative work, other accompanying drawing can also be obtained according to the accompanying drawing provided.
Fig. 1 is a kind of process flow diagram of cross-platform user identification method disclosed in the embodiment of the present invention one;
Fig. 2 is the another kind of process flow diagram of cross-platform user identification method disclosed in the embodiment of the present invention two;
Fig. 3 is another process flow diagram of cross-platform user identification method disclosed in the embodiment of the present invention three;
Fig. 4 is a kind of structural representation of cross-platform user's recognition system disclosed in the embodiment of the present invention four;
Fig. 5 is the another kind of structural representation of cross-platform user's recognition system disclosed in the embodiment of the present invention four;
Fig. 6 is the another kind of again structural representation of cross-platform user's recognition system disclosed in the embodiment of the present invention four.
Embodiment
Below in conjunction with the accompanying drawing in the embodiment of the present invention, be clearly and completely described the technical scheme in the embodiment of the present invention, obviously, described embodiment is only the present invention's part embodiment, instead of whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art, not making the every other embodiment obtained under creative work prerequisite, belong to the scope of protection of the invention.
Embodiment one
The present embodiment one discloses a kind of cross-platform user identification method, and with reference to figure 1, described method can comprise the following steps:
S101: the first message section obtaining first user account on the first platform, obtain the second message section of the second user account on the second platform, wherein, described first message section is the message section be made up of all message of issuing time in the first preset time period in described first user account, and described second message section is the message section be made up of all message of issuing time in the first preset time period in described second user account.
The present embodiment with whether Sina's microblogging account and Tengxun's microblogging account are belonged to same user be identified as example the inventive method is described.
Particularly, API (the Application Programming Interface that Sina's microblogging can be adopted specially offered, application programming interface) from Sina's microblogging account of setting, capture the Sina user message of issuing time in preset time period, adopt the specially offered API of Tengxun's microblogging to capture the Sina user message of issuing time in preset time period from Tengxun's microblogging account of setting.Such as, from Sina microblogging account userid1, specifically capture all Message-texts that nearest three months relative users are delivered or forwarded, form text chunk 1; From Tengxun microblogging account userid2, capture all Message-texts that nearest three months relative users are delivered or forwarded, form text chunk 2.
S102: carry out word segmentation processing to described first message section and described second message section respectively, obtains the first message section of point word form and the second message section of point word form.
Participle, refers to the sequence Chinese sentence being divided into word, as become " I likes China " after " I likes China " participle.
This step adopts participle software FudanNLP to continue to carry out word segmentation processing to the message section (the application specifically adopts text chunk form) of different platform two accounts obtained, as, word segmentation processing is carried out to the text chunk 1 of Sina microblogging account userid1 and the text chunk 2 of Tengxun microblogging account userid2.
S103: carry out feature extraction based on second message section of the participle feature preset to the first message section of described point of word form and point word form, and obtain the characteristic similarity numerical value of described first message section and described second message section on the basis of feature extraction.
Wherein, step S103 comprises:
A, respectively the feature extraction of ternary word is carried out to the first message section of point word form and the second message section of point word form, and the word both obtaining based on the number of the identical ternary word comprised in the first message section and the second message section comprises similarity numerical value.
The number of the identical ternary word that this step comprises according to two text chunks, judge the similarity degree of two text chunks, the number comprising identical ternary word is more, then think that similar degree is larger.
Ternary word refers to the structure be made up of 3 participles be connected in Message-text.
Step a realizes extracting each ternary word that the text chunk of two after participle comprises.Such as, the text A after participle is supposed: have a holiday or vacation and do not know what does; Text B: do not know what does.Then can extract 4 ternary words (1) from text A and have a holiday or vacation not only not (2) but also do not know that (3) do not know doing (4) knows what does.3 ternary words can be extracted from text B, be respectively: (1) does not know that doing (2) knows doing what (3) does what.
Afterwards, the number of the identical ternary word comprised according to two text chunks determines that the word of two text chunks comprises similarity numerical value.
Such as, if the number that two text chunks comprise identical ternary word is 0, then the word of two text chunks comprises similarity numerical value is 0; If the number comprising identical ternary word belongs to 0 ~ 50, then word comprises similarity numerical value is 1; If the number comprising identical ternary word belongs to 50 ~ 100, then word comprises similarity numerical value is 2; If the number comprising identical ternary word is greater than 100, then word comprises similarity numerical value is 3.
B, respectively high frequency words feature extraction is carried out to the first message section of point word form and the second message section of point word form, and the high frequency words similarity numerical value both obtaining based on the number of the identical high frequency words comprised in the first message section and the second message section.
The number of the identical high frequency words comprised in two sections of texts is more, shows that two sections of texts are more similar.
Wherein, first this step sorts to the order of the participle in each text chunk by participle frequency of occurrences descending, and such as, the segmentation sequence after " I is me " sequence is " I is ".
Afterwards, add up the number of identical participle in the high frequency words of foremost predetermined number in collating sequence corresponding to two text chunks, such as, add up the number of identical participle in front 100 high frequency words in two sequences, and determine the high frequency words similarity numerical value of two text chunks according to the number of identical participle.
Such as, when the number that can specify identical high frequency words is 0, high frequency words similarity numerical value is 0; When the number of identical high frequency words belongs to 0 ~ 20, high frequency words similarity numerical value is 1; When the number of identical high frequency words belongs to 20 ~ 50, high frequency words similarity numerical value is 2; When the number of identical high frequency words belongs to 50 ~ 100, high frequency words similarity numerical value is 3.
C, respectively the extraction of monocase probability of occurrence is carried out to the first message section of point word form and the second message section of point word form, and obtain both word distribution similarity numerical value based on the probability of occurrence of the identical monocase comprised in the first message section and the second message section.
This step obtains both word distribution similarity numerical value by the relative entropy D (p||q) of calculating two text chunk, p, q represent the first message section, the second message section respectively, p (x), q (x) represent that the probability that identical monocase x occurs respectively in the first message section and the second message section, X represent the character set of the first message section and identical monocase in the second message section.
Relative entropy numerical value is also referred to as KL (Kullback-Leibler divergence) distance, and the KL distance of two sections of words is less, and represent that the difference of the character stochastic distribution of these two sections of words is less, namely these two sections of words are more similar in the distribution of word.
Such as, p: I is me; Q: I am Chinese.Then in text p, the probability of ' I ' is p (I)=2/3, and the probability of ' being ' is p (' be ')=1/3; In text q, the probability of all characters is all 1/5, according to formula the KL distance of known text p and q is: D ((p//q))=(p (I) * log (p (I)/q (I)))+(p (YES) * log (p (YES)/q (YES))).
On this basis, according to the distribution similarity numerical value of KL distance acquisition two text chunks of two text chunks.
D, respectively the implicit theme of the first message section of point word form and the second message section of point word form to be extracted, and the Topic Similarity numerical value both obtaining based on the number of the same subject comprised in the first message section and the second message section.
The Twitter message text delivered in different social media due to same user has very large similarity, and based on this, if the number of same subject that two text chunks comprise is more, then to come from the possibility of same user larger for two text chunks.
This step, by the content of analysis two text chunks, adopts LDA (using document subject matter generation model) to extract its implicit theme, often occurs game, military information in the microblogging as hypothesis text chunk A; Often there is entertainment information in the microblogging of text chunk B, Taobao does shopping; Through LDA algorithm, the implicit theme of text chunk A is just game, military affairs etc., and the theme of text chunk B is then amusement, net purchase etc.
In follow-up foundation two text chunks, the number of same subject determines the Topic Similarity numerical value of two text chunks.Such as, if the number of same subject is 0, then Topic Similarity numerical characteristics value is 0; If the number of same subject is 1, Topic Similarity numerical value is 1; If the number of same subject is 2, Topic Similarity numerical value 2; If the number of same subject 3, Topic Similarity numerical value 3 etc.
Wherein, word comprises the different similarity degrees that the values such as similarity numerical value, high frequency words similarity numerical value or Topic Similarity numerical value 0,1,2,3 etc. only represent two text chunks, and numerical value is larger, and similarity degree is higher.
S104: judge described characteristic similarity numerical value whether within the scope of the similarity numeric reference preset.
S105: if the determination result is YES, then described first user account and described second user account belong to same user.
This step is by each similarity numerical value of two text chunks by reality acquisition, namely word comprises similarity numerical value, high frequency words similarity numerical value, word distribution similarity numerical value and Topic Similarity numerical value and prespecified reference data and compares, identify whether two text chunks (coming from two accounts of different platform respectively) belong to same user, and then realize cross-platform user identification.
Such as, suppose that the reference data that the account of two prespecified different platforms belongs to same user is: each characteristic similarity numerical value is all greater than 2.
Thus when each characteristic similarity numerical value only obtained when reality is all greater than 2, two text chunks belong to same user, and then to identify two accounts be same user; Otherwise during reference data against regulation, two accounts are different user.
From above scheme, cross-platform user identification method disclosed by the invention, take into full account the importance of user message in social platform, by user's information that user message in two accounts of different platform in the corresponding time period reflects, interest, preference and writing style, the similar situation of the customized informations such as word custom, identify whether user is same user, particularly, the inventive method obtains the message content of issuing time in preset time period in two accounts of different platform, and participle and feature extraction process are carried out to the message content of two accounts, on this basis, two accounts of different platform described in the participle characteristic similarity identification of two account information are utilized whether to belong to same user.Visible, the invention solves the identification problem of the same user of different social platform, and then provide support for the cross-platform data analysis of same user.
Embodiment two
In the present embodiment two, with reference to figure 2, described cross-platform user method can also comprise the following steps between step S102 and S103:
S106: respectively filtration treatment is carried out to the first message section of described point of word form and the second message section of point word form.
Wherein, this step comprises:
Stop words is gone to the first message section of described point of word form and goes low-frequency word process; Stop words is gone to the second message section of described point of word form and goes low-frequency word process.
Particularly, the message back that social platform user issues is toward more, number of times is frequent, the Twitter message issued of such as microblog users is more etc., cause gathered Message-text excessive, in the present embodiment, in order to improve the recognition speed of cross-platform user, respectively stop words gone to the text chunk of different platform two accounts and go low-frequency word (such as filtering out the participle that word frequency is less than 3) to process, namely the participle that reference value is relatively low is removed, reduce the dimension of proper vector, achieve when relatively not affecting recognition accuracy, accelerate recognition speed.
Embodiment three
In the present embodiment three, with reference to figure 3, described cross-platform user identification method can also comprise:
S107: the message section sample pair utilizing setting number in advance, and based on the characteristic similarity that each message section sample is right, cross-platform user's recognition training is carried out to maximum entropy sorting technique, obtain maximum entropy classifiers, whether belong to same user to realize adopting on described maximum entropy classifiers identification first platform the second user account in first user account and the second platform.
Wherein, two message sections that described message section sample centering comprises belong to two accounts of different platform respectively, described two accounts are the account of same subscriber or the account of different user, described message section sample centering comprise the issuing time of message in the second preset time period; Described characteristic similarity comprises word and comprises similarity, high frequency words similarity, word distribution similarity and Topic Similarity.
In order to improve the accuracy rate that cross-platform user identifies, this step utilizes the message section sample of setting scale to as training sample in advance, carries out cross-platform user's recognition training, obtain maximum entropy classifiers to maximum entropy sorting technique.
Be described below by way of the building process of instantiation to maximum entropy classifiers.
Collect described two kinds of platform accounts (i.e. userid) that 1200 have each user in the user of Sina's microblogging account and Tengxun's microblogging account simultaneously, obtain 1200 Sina microblogging userid and 1200 the Tengxun microblogging userid of 1200 users; Collect the Sina microblogging userid of each user in the user of 1200 Ge Jinyou Sina microblogging accounts, collect the Tengxun microblogging userid of each user in the user of 1200 Ge Jinyou Tengxun microblogging accounts.And by different according to platform for the userid collected, be built into two account list: Sina's account list and Tengxun's account list.
On this basis, utilize Sina's microblogging and the specially offered api interface of Tengxun's microblogging respectively, capture according to user list all Twitter messages that user delivered at nearly three months, obtain the Message-text section of each userid.Adopt participle software FudanNLP afterwards, respectively word segmentation processing is carried out to the text chunk of each userid, and the userid that the text chunk after participle is corresponding to account list is associated, wherein, every a line represents the Message-text section (point word form) of an account.
It is right to be organized between two by two text chunks belonging to same user under two kinds of platform accounts, organizes right between two, obtain 1200 text chunk samples pair with user altogether by cross-platform for all the other text chunks, and the text chunk sample pair of 1200 different users.
To each text chunk sample to carrying out word segmentation processing and removing stop words, go low-frequency word process.Afterwards, choose 1000 with user version section sample to and 1000 different user text chunk samples to the calculating carrying out feature extraction and characteristic similarity numerical value, formed training sample; Simultaneously to remaining 200 with user version section sample to and 200 different user text chunk samples to the calculating carrying out feature extraction and characteristic similarity numerical value, formed test sample book.Wherein, characteristic similarity numerical value comprises word and comprises similarity numerical value, high frequency words similarity numerical value, word distribution similarity numerical value and Topic Similarity numerical value.The acquisition process of word segmentation processing in this step, filtration treatment, feature extraction and characteristic similarity numerical value specifically can the explanation of reference example one, no longer describes in detail herein.
On this basis, based on the characteristic similarity that each training sample is right, utilize training sample to maximum entropy sorting technique carry out across user identify classification based training (it is a class categories that two text chunks belong to same user, do not belong to same user for another class categories), build maximum entropy classifiers.
Wherein, maximum entropy sorting technique is based on maximum entropy information theory, and its basic thought is all known factor Modling model, and the factor of all the unknowns is foreclosed, namely to find a kind of probability distribution, meet all known facts, but allow the most randomization of unknown factor.Relative to Nae Bayesianmethod, the maximum feature of the method is exactly the conditional sampling not between demand fulfillment feature and feature, and therefore, the method is applicable to the various different feature of statistics, and without the need to considering the impact between them.
Under maximum entropy model, the formula of predicted condition probability P (c|D) is as follows:
P ( c i | D ) = 1 Z ( D ) exp ( Σ k λ k , c F k , c ( D , c i ) ) - - - ( 1 )
Wherein, Z (D) is normalized factor; λ k,cfundamental function F k,cweights, can λ be obtained in the process building base sorter k,cvalue; The value of i is 1 or 0; Each feature in k representation feature space (in the application, specifically referring to each characteristic similarity), its value is from 1 to the size of feature space; F k,cbe fundamental function, be defined as:
F k , c ( D , c ′ ) = 1 , n k ( d ) > 0 and c ′ 0 , otherwiese - - - ( 2 )
Wherein, n kd () represents the length of sample to be identified to comprised feature, in the application, and n kd () is greater than 0 all the time; C represents that sample to be identified is to the legitimate reading whether belonging to same user, the result after c' presentation class device classification (identification), if the result of sorter identification and coming to the same thing really, then and F k,cvalue be 1, if identify result and real result inconsistent, then F k,cvalue be 0.
Such as, respectively above word is comprised, high frequency words, word distribution and Topic Similarity as the 1st, 2,3,4 feature, for each sample to be sorted, above 4 features are all exist (just the value of character representation is different), thus n k(d)=4, then n k(d) >0.
The follow-up classification performance originally testing constructed sorter by above test specimens, applicant has higher nicety of grading by the sorter constructed by the checking of actual test figure, does not adopt the recognition accuracy of sorter to have significantly to promote based on comparing to across user's recognition accuracy of sorter.
Embodiment four
The embodiment of the present invention four discloses a kind of cross-platform user's recognition system, and described system is corresponding with cross-platform user identification method disclosed in embodiment one to embodiment three.
First, corresponding to embodiment one, with reference to figure 4, described system comprises message capturing module 100, word segmentation processing module 200, feature extraction module 300, judge module 400 and identification module 500.
Message capturing module 100, for obtaining the first message section of first user account on the first platform, obtain the second message section of the second user account on the second platform, wherein, described first message section is the message section be made up of all message of issuing time in the first preset time period in described first user account, and described second message section is the message section be made up of all message of issuing time in the first preset time period in described second user account.
Word segmentation processing module 200, for carrying out word segmentation processing to described first message section and described second message section respectively, obtains the first message section of point word form and the second message section of point word form.
Feature extraction module 300, for carrying out feature extraction based on second message section of the participle feature preset to the first message section of described point of word form and point word form, and on the basis of feature extraction, obtain the characteristic similarity numerical value of described first message section and described second message section.
Wherein, feature extraction module 300 comprises the first extracting unit, the second extracting unit, the 3rd extracting unit and the 4th extracting unit.
First extracting unit, for carrying out the feature extraction of ternary word to the first message section of point word form and the second message section of point word form respectively, and comprise similarity numerical value based on the word of both number acquisitions of the identical ternary word comprised in the first message section and the second message section; ;
Second extracting unit, for carrying out high frequency words feature extraction to the first message section of point word form and the second message section of point word form respectively, and the high frequency words similarity numerical value both obtaining based on the number of the identical high frequency words comprised in the first message section and the second message section;
3rd extracting unit, for carrying out the extraction of monocase probability of occurrence to the first message section of point word form and the second message section of point word form respectively, and obtain both word distribution similarity numerical value based on the probability of occurrence of the identical monocase comprised in the first message section and the second message section;
4th extracting unit, for extracting the implicit theme of the first message section of point word form and the second message section of point word form respectively, and the Topic Similarity numerical value both obtaining based on the number of the same subject comprised in the first message section and the second message section.
Judge module 400, for judging described characteristic similarity numerical value whether within the scope of the similarity numeric reference preset.
Identification module 500, for when judged result is for being, identifies described first user account and described second user account belongs to same user.
Corresponding to embodiment two, with reference to figure 5, described system also comprises the filtering module 600 for carrying out filtration treatment respectively to the first message section of described point of word form and the second message section of point word form, and this module comprises the first filter element and the second filter element.
First filter element, for removing stop words and going low-frequency word process to the first message section of described point of word form;
Second filter element, for removing stop words and going low-frequency word process to the second message section of described point of word form.
Corresponding to embodiment three, with reference to figure 6, described system also comprises pretreatment module 700, this module is used for the message section sample pair utilizing setting number in advance, and based on the characteristic similarity that each message section sample is right, cross-platform user's recognition training is carried out to maximum entropy sorting technique, obtain maximum entropy classifiers, to realize adopting on described maximum entropy classifiers identification first platform the second user account in first user account and the second platform whether to belong to same user, wherein:
Two message sections that described message section sample centering comprises belong to two accounts of different platform respectively, described two accounts are the account of same subscriber or the account of different user, described message section sample centering comprise the issuing time of message in the second preset time period;
Described characteristic similarity comprises word and comprises similarity, high frequency words similarity, word distribution similarity and Topic Similarity.
For user's recognition system cross-platform disclosed in the embodiment of the present invention four, because it is corresponding with cross-platform user identification method disclosed in embodiment one to embodiment three, so description is fairly simple, relevant similarity refers to the explanation of cross-platform user identification method part in embodiment one to embodiment three, no longer describes in detail herein
In sum, the present invention takes into full account the importance of user message in social platform, the similar situation of the customized information such as user's information, interest, preference and writing style, word custom reflected by user message in two accounts of different platform in the corresponding time period, identify whether user is same user, and the accuracy rate of cross-platform user identification is improved by building maximum entropy classifiers in advance, solve the identification problem of the same user of different social platform, for the cross-platform data analysis of same user provides support.
It should be noted that, each embodiment in this instructions all adopts the mode of going forward one by one to describe, and what each embodiment stressed is the difference with other embodiments, between each embodiment identical similar part mutually see.
For convenience of description, various module or unit is divided into describe respectively with function when describing above system.Certainly, the function of each unit can be realized in same or multiple software and/or hardware when implementing the application.
Finally, also it should be noted that, in this article, the relational terms of such as first, second, third and fourth etc. and so on is only used for an entity or operation to separate with another entity or operational zone, and not necessarily requires or imply the relation that there is any this reality between these entities or operation or sequentially.And, term " comprises ", " comprising " or its any other variant are intended to contain comprising of nonexcludability, thus make to comprise the process of a series of key element, method, article or equipment and not only comprise those key elements, but also comprise other key elements clearly do not listed, or also comprise by the intrinsic key element of this process, method, article or equipment.When not more restrictions, the key element limited by statement " comprising ... ", and be not precluded within process, method, article or the equipment comprising described key element and also there is other identical element.
It should be noted that, each embodiment in this instructions all adopts the mode of going forward one by one to describe, and what each embodiment stressed is the difference with other embodiments, between each embodiment identical similar part mutually see.
The above is only the preferred embodiment of the present invention; it should be pointed out that for those skilled in the art, under the premise without departing from the principles of the invention; can also make some improvements and modifications, these improvements and modifications also should be considered as protection scope of the present invention.

Claims (10)

1. a cross-platform user identification method, is characterized in that, comprising:
Obtain the first message section of first user account on the first platform, obtain the second message section of the second user account on the second platform, wherein, described first message section is the message section be made up of all message of issuing time in the first preset time period in described first user account, and described second message section is the message section be made up of all message of issuing time in the first preset time period in described second user account;
Respectively word segmentation processing is carried out to described first message section and described second message section, obtain the first message section of point word form and the second message section of point word form;
Carry out feature extraction based on second message section of the participle feature preset to the first message section of described point of word form and point word form, and on the basis of feature extraction, obtain the characteristic similarity numerical value of described first message section and described second message section;
Judge described characteristic similarity numerical value whether within the scope of the similarity numeric reference preset;
If the determination result is YES, then described first user account and described second user account belong to same user.
2. to go the method described in 1 according to right, it is characterized in that, second message section of the described participle feature based on presetting to the first message section of described point of word form and point word form carries out feature extraction, and on the basis of feature extraction, obtain the characteristic similarity numerical value of described first message section and described second message section, comprising:
Respectively the feature extraction of ternary word is carried out to the first message section of point word form and the second message section of point word form, and comprise similarity numerical value based on the word of both number acquisitions of the identical ternary word comprised in the first message section and the second message section;
Respectively high frequency words feature extraction is carried out to the first message section of point word form and the second message section of point word form, and the high frequency words similarity numerical value both obtaining based on the number of the identical high frequency words comprised in the first message section and the second message section;
Respectively the extraction of monocase probability of occurrence is carried out to the first message section of point word form and the second message section of point word form, and obtain both word distribution similarity numerical value based on the probability of occurrence of the identical monocase comprised in the first message section and the second message section;
Respectively the implicit theme of the first message section of point word form and the second message section of point word form is extracted, and the Topic Similarity numerical value both obtaining based on the number of the same subject comprised in the first message section and the second message section.
3. to go the method described in 1 according to right, it is characterized in that, before feature extraction is carried out to the first message section of point word form and the second message section of point word form, also comprise: carry out filtration treatment to the first message section of described point of word form and the second message section of point word form respectively, described filtration treatment comprises:
Stop words is gone to the first message section of described point of word form and goes low-frequency word process;
Stop words is gone to the second message section of described point of word form and goes low-frequency word process.
4. to go the method described in 1 according to right, it is characterized in that, also comprise:
Utilize the message section sample pair of setting number in advance, and based on the characteristic similarity that each message section sample is right, cross-platform user's recognition training is carried out to maximum entropy sorting technique, obtain maximum entropy classifiers, to realize adopting on described maximum entropy classifiers identification first platform the second user account in first user account and the second platform whether to belong to same user, wherein:
Two message sections that described message section sample centering comprises belong to two accounts of different platform respectively, described two accounts are the account of same subscriber or the account of different user, described message section sample centering comprise the issuing time of message in the second preset time period;
Described characteristic similarity comprises word and comprises similarity, high frequency words similarity, word distribution similarity and Topic Similarity.
5. method according to claim 2, is characterized in that, the word obtaining both by the relative entropy D (p||q) calculating the first message section and the second message section distributes similarity numerical value;
Wherein, p, q represent the first message section, the second message section respectively, p (x), q (x) represent that the probability that identical monocase x occurs respectively in the first message section and the second message section, X represent the character set of the first message section and identical monocase in the second message section.
6. method according to claim 2, is characterized in that, uses document subject matter generation model LDA to the first message section of point word form and divides the implicit theme of the second message section of word form to extract.
7. cross-platform user's recognition system, is characterized in that, comprising:
Message capturing module, for obtaining the first message section of first user account on the first platform, obtain the second message section of the second user account on the second platform, wherein, described first message section is the message section be made up of all message of issuing time in the first preset time period in described first user account, and described second message section is the message section be made up of all message of issuing time in the first preset time period in described second user account;
Word segmentation processing module, for carrying out word segmentation processing to described first message section and described second message section respectively, obtains the first message section of point word form and the second message section of point word form;
Feature extraction module, for carrying out feature extraction based on second message section of the participle feature preset to the first message section of described point of word form and point word form, and on the basis of feature extraction, obtain the characteristic similarity numerical value of described first message section and described second message section;
Judge module, for judging described characteristic similarity numerical value whether within the scope of the similarity numeric reference preset;
Identification module, for when judged result is for being, identifies described first user account and described second user account belongs to same user.
8. will go the system described in 7 according to right, it is characterized in that, described feature extraction module comprises:
First extracting unit, for carrying out the feature extraction of ternary word to the first message section of point word form and the second message section of point word form respectively, and comprise similarity numerical value based on the word of both number acquisitions of the identical ternary word comprised in the first message section and the second message section; ;
Second extracting unit, for carrying out high frequency words feature extraction to the first message section of point word form and the second message section of point word form respectively, and the high frequency words similarity numerical value both obtaining based on the number of the identical high frequency words comprised in the first message section and the second message section;
3rd extracting unit, for carrying out the extraction of monocase probability of occurrence to the first message section of point word form and the second message section of point word form respectively, and obtain both word distribution similarity numerical value based on the probability of occurrence of the identical monocase comprised in the first message section and the second message section;
4th extracting unit, for extracting the implicit theme of the first message section of point word form and the second message section of point word form respectively, and the Topic Similarity numerical value both obtaining based on the number of the same subject comprised in the first message section and the second message section.
9. to go the system described in 7 according to right, it is characterized in that, also comprise: for carrying out the filtering module of filtration treatment respectively to the first message section of described point of word form and the second message section of point word form, described filtering module comprises:
First filter element, for removing stop words and going low-frequency word process to the first message section of described point of word form;
Second filter element, for removing stop words and going low-frequency word process to the second message section of described point of word form.
10. to go the system described in 7 according to right, it is characterized in that, also comprise:
Pretreatment module, for utilizing the message section sample pair of setting number in advance, and based on the characteristic similarity that each message section sample is right, cross-platform user's recognition training is carried out to maximum entropy sorting technique, obtain maximum entropy classifiers, to realize adopting on described maximum entropy classifiers identification first platform the second user account in first user account and the second platform whether to belong to same user, wherein:
Two message sections that described message section sample centering comprises belong to two accounts of different platform respectively, described two accounts are the account of same subscriber or the account of different user, described message section sample centering comprise the issuing time of message in the second preset time period;
Described characteristic similarity comprises word and comprises similarity, high frequency words similarity, word distribution similarity and Topic Similarity.
CN201410521299.5A 2014-09-30 2014-09-30 Cross-platform user identification method and cross-platform user identification system Pending CN104317784A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410521299.5A CN104317784A (en) 2014-09-30 2014-09-30 Cross-platform user identification method and cross-platform user identification system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410521299.5A CN104317784A (en) 2014-09-30 2014-09-30 Cross-platform user identification method and cross-platform user identification system

Publications (1)

Publication Number Publication Date
CN104317784A true CN104317784A (en) 2015-01-28

Family

ID=52373017

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410521299.5A Pending CN104317784A (en) 2014-09-30 2014-09-30 Cross-platform user identification method and cross-platform user identification system

Country Status (1)

Country Link
CN (1) CN104317784A (en)

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104778388A (en) * 2015-05-04 2015-07-15 苏州大学 Method and system for identifying same user under two different platforms
CN105183806A (en) * 2015-08-26 2015-12-23 苏州大学张家港工业技术研究院 Method and system for identifying same user among different platforms
CN106034149A (en) * 2015-03-13 2016-10-19 阿里巴巴集团控股有限公司 Account identification method and device
CN106126654A (en) * 2016-06-27 2016-11-16 中国科学院信息工程研究所 A kind of inter-network station based on user name similarity user-association method
CN106161406A (en) * 2015-04-22 2016-11-23 深圳市腾讯计算机系统有限公司 The method and apparatus obtaining user account
CN107465718A (en) * 2017-06-20 2017-12-12 晶赞广告(上海)有限公司 Across the ID recognition methods of application and device, storage medium, terminal
CN107832783A (en) * 2017-10-25 2018-03-23 平安科技(深圳)有限公司 Across social platform user matching method, data processing equipment and readable storage medium storing program for executing
CN109145529A (en) * 2018-09-12 2019-01-04 重庆工业职业技术学院 A kind of text similarity analysis method and system for copyright authentication
CN110222790A (en) * 2019-06-17 2019-09-10 南京中孚信息技术有限公司 Method for identifying ID, device and server
CN110324278A (en) * 2018-03-29 2019-10-11 北大方正集团有限公司 Account main body consistency detecting method, device and equipment
CN110826605A (en) * 2019-10-24 2020-02-21 北京明略软件系统有限公司 Method and device for identifying user in cross-platform manner
CN110838995A (en) * 2019-10-16 2020-02-25 西安交通大学 Blind self-adaptive multi-user detection method based on generalized maximum correlation entropy criterion
CN111767438A (en) * 2020-06-16 2020-10-13 上海同犀智能科技有限公司 Identity recognition method based on Hash combined integral
CN111881304A (en) * 2020-07-21 2020-11-03 百度在线网络技术(北京)有限公司 Author identification method, device, equipment and storage medium
CN112418294A (en) * 2020-11-18 2021-02-26 青岛海尔科技有限公司 Method, device, storage medium and electronic device for determining account type

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101840514A (en) * 2009-03-19 2010-09-22 株式会社理光 Image object classification device and method
CN101980196A (en) * 2010-10-25 2011-02-23 中国农业大学 Article comparison method and device
CN103729474A (en) * 2014-01-23 2014-04-16 中国科学院计算技术研究所 Method and system for identifying vest account numbers of forum users
CN103778260A (en) * 2014-03-03 2014-05-07 哈尔滨工业大学 Individualized microblog information recommending system and method
CN103793481A (en) * 2014-01-16 2014-05-14 中国科学院软件研究所 Microblog word cloud generating method based on user interest mining and accessing supporting system
CN103838789A (en) * 2012-11-27 2014-06-04 大连灵动科技发展有限公司 Text similarity computing method

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101840514A (en) * 2009-03-19 2010-09-22 株式会社理光 Image object classification device and method
CN101980196A (en) * 2010-10-25 2011-02-23 中国农业大学 Article comparison method and device
CN103838789A (en) * 2012-11-27 2014-06-04 大连灵动科技发展有限公司 Text similarity computing method
CN103793481A (en) * 2014-01-16 2014-05-14 中国科学院软件研究所 Microblog word cloud generating method based on user interest mining and accessing supporting system
CN103729474A (en) * 2014-01-23 2014-04-16 中国科学院计算技术研究所 Method and system for identifying vest account numbers of forum users
CN103778260A (en) * 2014-03-03 2014-05-07 哈尔滨工业大学 Individualized microblog information recommending system and method

Cited By (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106034149A (en) * 2015-03-13 2016-10-19 阿里巴巴集团控股有限公司 Account identification method and device
CN106034149B (en) * 2015-03-13 2019-06-18 阿里巴巴集团控股有限公司 A kind of account recognition methods and device
US10462257B2 (en) 2015-04-22 2019-10-29 Tencent Technology (Shenzhen) Company Limited Method and apparatus for obtaining user account
CN106161406A (en) * 2015-04-22 2016-11-23 深圳市腾讯计算机系统有限公司 The method and apparatus obtaining user account
CN106161406B (en) * 2015-04-22 2019-12-03 深圳市腾讯计算机系统有限公司 The method and apparatus for obtaining user account
CN104778388A (en) * 2015-05-04 2015-07-15 苏州大学 Method and system for identifying same user under two different platforms
CN105183806A (en) * 2015-08-26 2015-12-23 苏州大学张家港工业技术研究院 Method and system for identifying same user among different platforms
CN106126654A (en) * 2016-06-27 2016-11-16 中国科学院信息工程研究所 A kind of inter-network station based on user name similarity user-association method
CN106126654B (en) * 2016-06-27 2019-10-18 中国科学院信息工程研究所 A kind of inter-network station user-association method based on user name similarity
CN107465718A (en) * 2017-06-20 2017-12-12 晶赞广告(上海)有限公司 Across the ID recognition methods of application and device, storage medium, terminal
CN107465718B (en) * 2017-06-20 2020-12-22 晶赞广告(上海)有限公司 Cross-application ID identification method and device, storage medium and terminal
CN107832783A (en) * 2017-10-25 2018-03-23 平安科技(深圳)有限公司 Across social platform user matching method, data processing equipment and readable storage medium storing program for executing
CN110324278A (en) * 2018-03-29 2019-10-11 北大方正集团有限公司 Account main body consistency detecting method, device and equipment
CN109145529A (en) * 2018-09-12 2019-01-04 重庆工业职业技术学院 A kind of text similarity analysis method and system for copyright authentication
CN110222790A (en) * 2019-06-17 2019-09-10 南京中孚信息技术有限公司 Method for identifying ID, device and server
CN110222790B (en) * 2019-06-17 2021-05-25 南京中孚信息技术有限公司 User identity identification method and device and server
CN110838995A (en) * 2019-10-16 2020-02-25 西安交通大学 Blind self-adaptive multi-user detection method based on generalized maximum correlation entropy criterion
CN110838995B (en) * 2019-10-16 2020-10-27 西安交通大学 Blind self-adaptive multi-user detection method based on generalized maximum correlation entropy criterion
CN110826605A (en) * 2019-10-24 2020-02-21 北京明略软件系统有限公司 Method and device for identifying user in cross-platform manner
CN111767438A (en) * 2020-06-16 2020-10-13 上海同犀智能科技有限公司 Identity recognition method based on Hash combined integral
CN111881304A (en) * 2020-07-21 2020-11-03 百度在线网络技术(北京)有限公司 Author identification method, device, equipment and storage medium
CN111881304B (en) * 2020-07-21 2024-04-26 百度在线网络技术(北京)有限公司 Author identification method, device, equipment and storage medium
CN112418294A (en) * 2020-11-18 2021-02-26 青岛海尔科技有限公司 Method, device, storage medium and electronic device for determining account type

Similar Documents

Publication Publication Date Title
CN104317784A (en) Cross-platform user identification method and cross-platform user identification system
Alberto et al. Tubespam: Comment spam filtering on youtube
KR101536520B1 (en) Method and server for extracting topic and evaluating compatibility of the extracted topic
CN105787025B (en) Network platform public account classification method and device
CN103336766A (en) Short text garbage identification and modeling method and device
CN104484343B (en) It is a kind of that method of the motif discovery with following the trail of is carried out to microblogging
CN106339502A (en) Modeling recommendation method based on user behavior data fragmentation cluster
CN106886567B (en) Microblogging incident detection method and device based on semantic extension
CN106156372B (en) A kind of classification method and device of internet site
CN106250513A (en) A kind of event personalization sorting technique based on event modeling and system
CN104966031A (en) Method for identifying permission-irrelevant private data in Android application program
CN104408093A (en) News event element extracting method and device
CN104281653A (en) Viewpoint mining method for ten million microblog texts
CN103324666A (en) Topic tracing method and device based on micro-blog data
CN102663139A (en) Method and system for constructing emotional dictionary
Riadi Detection of cyberbullying on social media using data mining techniques
CN104424308A (en) Web page classification standard acquisition method and device and web page classification method and device
CN105138653A (en) Exercise recommendation method and device based on typical degree and difficulty
CN105843796A (en) Microblog emotional tendency analysis method and device
CN103164698A (en) Method and device of generating fingerprint database and method and device of fingerprint matching of text to be tested
CN109933648B (en) Real user comment distinguishing method and device
CN110990676A (en) Social media hotspot topic extraction method and system
CN104598632A (en) Hot event detection method and device
CN110287314A (en) Long text credibility evaluation method and system based on Unsupervised clustering
CN105224955A (en) Based on the method for microblogging large data acquisition network service state

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20150128